Computational Learning Theory

Size: px

Start display at page:

Download "Computational Learning Theory"

Oswald Wilkinson
5 years ago
Views:

1 1 Computational Learning Theory

2 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number of training examples necessary or sufficient to assure successful learning? How is this number affected if the learner is allowed to pose queries to the trainer? Can we characterize the number of mistakes that a learner will make before learning the target function? Can we characterize the inherent computational complexity of classes of learning problems? General answers to all these questions are not yet known. This chapter Sample complexity Computational complexity Mistake bound

3 Probably Approximately Correct (PAC) Learning [1] 3 Problem Setting for concept learning: X : Set of all possible instances over which target functions may be defined. Training and Testing instances are generated from X according some unknown distribution D. We assume that D is stationary. C : Set of target concepts that our learner might be called upon to learn. Target concept is a Boolean function c : X {0,1}. H : Set of all possible hypotheses. Goal : Producing hypothesis h H which is an estimate of c. Evaluation: Performance of h measured over new samples drawn randomly using distribution D. Error of a Hypothesis : The training error (sample error ) of hypothesis h with respect to target concept c and data sample S of size n is. 1 errors ( h) δ [ c( x) h( x )] n x S The true error (denoted error D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. error error D (h) depends strongly of the D. D(x) is prob. of presenting x. h is approximately correct if error D (h) ε D error ( h) Pr [ c( x ) h( x )] D x D ( h) h( x) c( x) D( x) - Instance Space X c Where c and h disagree + + h - -

4 Probably Approximately Correct (PAC) Learning [2] 4 We are trying to characterize the number of training examples needed to learn a hypothesis h for which error D (h)=0. Problems : May be multiple consistent hypotheses and the learner can not pickup one of them. Since training set is chosen randomly, the true error may not be zero. To accommodate these difficulties, we will not require that : True error may not be zero, We require that true error is bounded by some constant ε. The learner succeed for every training sequence, we require that the failure probability is bounded by some constant δ. δ Is confidence parameter. In short, we require only that the learner probably learns a hypothesis that is approximately correct. PAC Learnability : Definition : C is PAC-learnable by L using H if for all c C, distributions D over X, there exists an ε such that 0 < ε < 1/2, and a δ such that 0 < δ < 1/2, learner L will, with probability at least (1 - δ), output a hypothesis h H such that error D (h) ε in time that is polynomial in (1/ ε), (1/ δ), n, and C. If L requires some minimum processing time per training example, then for C to be PAC-Learnable by L, L must learn from a polynomial number of training examples.

5 Probably Approximately Correct (PAC) Learning [3] 5 Sample Complexity : The growth in the number of required training examples with problem size. The most limiting factor for success of a learner is the limited availability of training data. Consistent learner : A learner is consistent if it outputs hypotheses that perfectly fit the training data, whenever possible. Our Concern : Can we bound true error of h (given training error of h)? Definition : Version space VS H,D is said to be ε-exhausted with respect to c and D, if all h VS H,D has true error less than ε with respect to c and D ( h VS H,D. error D (h) < ε) Hypothesis Space H error = 0.1 r = 0.2 error = 0.2 r = 0.0 error = 0.3 r = 0.4 error = 0.3 r = 0.1 VS H,D error = 0.1 r = 0.0 error = 0.2 r = 0.3 (r = training error, error = true error)

6 Probably Approximately Correct (PAC) Learning [4] 6 Theorem [Haussler, 1988] If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 ε 1, the probability that the version space with respect to H and D is not ε-exhausted (with respect to c) is less than or equal to H e - ε m Important Result! Bounds the probability that any consistent learner will output a hypothesis h with error(h) ε Want this probability to be below a specified threshold δ H e - ε m δ To achieve, solve inequality for m: let m 1/ε (ln H + ln (1/δ)) Need to see at leas this many examples It is possible that H e - ε m > 1.

7 Probably Approximately Correct (PAC) Learning [5] 7 Example : H: conjunctions of constraints on up to n boolean attributes (n boolean literals) H = 3 n, m 1/ε (ln 3 n + ln (1/δ)) = 1/ε (n ln 3 + ln (1/δ)) How About EnjoySport? H as given in EnjoySport (conjunctive concepts with don t cares) H = 973 m 1/ε (ln H + ln (1/δ)) Example goal: probability 1 - δ = 95% of hypotheses with error D (h) < 0.1 m 1/0.1 (ln ln (1/0.05)) 98.8 Example Sky Air Temp Humidity Wind Water Forecast Enjoy Sport 0 Sunny Warm Normal Strong Warm Same Yes 1 Sunny Warm High Strong Warm Same Yes 2 Rainy Cold High Strong Warm Change No 3 Sunny Warm High Strong Cool Change Yes

8 Probably Approximately Correct (PAC) Learning [6] 8 Unbiased Learner Recall: sample complexity bound m 1/ε (ln H + ln (1/δ)) Sample complexity not always polynomial Example: for unbiased learner, H = 2 X Suppose X consists of n booleans (binary-valued attributes) X = 2 n, H = 2 2n m 1/ε (2 n ln 2 + ln (1/δ)) Sample complexity for this H is exponential in n Agnostic Learner : A learner that make no assumption that the target concept is representable by H and that simply finds the hypothesis with minimum error. How Hard Is This? Sample complexity: m 1/2ε 2 (ln H + ln (1/δ)) Derived from Hoeffding bounds: P [TrueError D (h) > TrainingError D (h) + ε] e -2mε2

9 Probably Approximately Correct (PAC) Learning [7] 9 Drawbacks of sample complexity The bound is not tight, when H is large and probability may be grater than 1. When H is infinite. Vapnik-Chervonekis dimension (VC (H)) VC-dimension measures complexity of hypothesis space H, not by the number of distinct hypotheses H, but by the number of distinct instances from X that can be completely discriminated using H. Dichotomies: A dichotomy(concept) of a set S is a partition of S into two subsets S 1 and S 2 Shattering A set of instances S is shattered by hypothesis space H if and only if for every dichotomy (concept) of S, there exists a hypothesis in H consistent with this dichotomy Intuition: a rich set of functions shatters a larger instance space Instance Space X

10 10 Vapnik-Chervonekis Dimension [1] From Chapter 2, unbiased hypotheses space is capable of representing every possible concept (dichotomy) defined over the instance space X. Unbiased hypotheses space H can shatter instance space X. If H cannot shatter X, but can shatter some large subset S of X, what happens? This is defined by VC (H). Vapnik-Chervonekis dimension (VC (H)) VC (H) of hypotheses space H defined over the instance space X is the size of largest finite subset of X shattered by H. If arbitrary large finite sets of X can be shattered by H, then VC (H) =. For any finite H, VC (H) log 2 H

11 Vapnik-Chervonekis Dimension [2] 11 Example : X = R : The set of real numbers. H : The set of intervals on real line in form of a < x < b, where a and b may be any real constants. What is VC (H)? Let S = {3.1,5.7}, can S be shattered by H? 1 < x < 2 1 < x < 4 4 < x < 7 1 < x < 7 Let S = {x 0, x 1, x 2 } (x 0 < x 1 < x 2 ), can S be shattered by H? The dichotomy that includes x 0 and x 2 but not x 1 cannot be shattered by H. Thus VC (H) = 2.

12 Vapnik-Chervonekis Dimension [3] 12 Example : X :The set of instances corresponding to points in x-y plane. H : The set of all linear decision surfaces in the plane (such as decision function of perceptron). What is VC (H)? Colinear points cannot be shattered! Thus VC (H) = 2 or 3 or. VC (H) > 3. To show VC (H) < d, we must show that no set of size d can be shattered. In this example, no set of size 4, can be shattered, hence VC (H) = 3 It can be shown that VC-dimension of linear decision surfaces in r-dimensional is r+1. Example : X :The set of instances corresponding to points in x-y plane. H : The set of all axis-aligned rectangles in two dimensions. What is VC (H)?

13 Vapnik-Chervonekis Dimension [4] 13 Example : X :The set of instances corresponding to conjunction of the exactly three Boolean literals. H : The conjunction of up to three Boolean literals. What is VC (H)? Representing each instance by a three-bit string corresponding to literals l 1, l 2, and l 3 such as Instance 1 : 100 Instance 2 : 010 Instance 3 : 001 A hypothesis can be constructed for any desired dichotomy using the following rule. If dichotomy is to exclude instance 1, add ( l 1 ) to hypothesis, ex. ( l 1 l 3 ), ( l 1 l 2 ) Thus, VC(H) for conjunction of n Boolean literals is equal to n. Example : Feed forward Neural Networks with N free parameters VC for a neural network with linear activation function is O(N). VC for a neural network with Threshold activation function is O(N log N). VC for a neural network with sigmoid activation function is O(N 2 ).

14 14 Mistake Bounds How many mistakes will the learner make in its prediction before it learns the target concept. Example : Find-S Suppose H be conjunction of up to n Boolean literals and their negations Find-S Initialize h to the most specific hypothesis l 1 l 1 l 2 l 2 l n l n For each positive training instance x do remove from h any literal that is not satisfied by x Output hypothesis h How Many Mistakes before Converging to Correct h? Once a literal is removed, it is never put back No false positives (started with most restrictive h): count false negatives First example will remove n candidate literals Worst case: every remaining literal is also removed (incurring 1 mistake each) Find-S makes at most n + 1 mistakes

Computational Learning Theory

Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son