The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1
Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1
Outline Universal Concept Class Let X = {0, 1} n and let U n = P(X be the concept class formed by all subsets of X. To guarantee a consistent hypothesis, the hypothesis class must include the concept class, so H U n = 2 n. We have m 1 ( 2 n log 2 + log 1, ɛ δ The number of example is exponential in n by the theorem, so PAC learnability does not follow. 3 / 1
Finite Hypothesis Space - The Inconsistent Case Framework If the concept class is more complex than the hypotheses space it may be the case that there is no hypothesis consistent with a labeled training sample, that is, for no h S we would have ˆR(h S = 0. We use the corollary of Hoeffding s Inequality: Corollary Let X 1,..., X n be n independent random variables such that X i [0, 1] for 1 i n and let Z n be the random variable defined by: Z n = 1 n n X i. i=1 The following inequalities hold: P(Z n E(Z n ɛ e 2nɛ2 P(Z n E(Z n ɛ e 2nɛ2. 4 / 1
Finite Hypothesis Space - The Inconsistent Case Framework (cont d Recall that R(h = E( ˆR(h. The Corolarry of Hoeffding s Inequality applied to ˆR(h = 1 m {x i h(x i c(x i } m i=1 implies that for ɛ > 0, any S = (x 1,..., x m size n and any hypothesis h : X {0, 1} the following inequalities hold: ( P ˆR(h R(h ɛ e 2mɛ2 ( P ˆR(h R(h ɛ e 2mɛ2 Therefore, and P P ( ˆR(h R(h ɛ 2e 2mɛ2. ( ˆR(h R(h < ɛ 1 2e 2mɛ2. (1 5 / 1
Finite Hypothesis Space - The Inconsistent Case Generalization Bound - Single Hypothesis Corollary For a random hypothesis h : X {0, 1} and for any δ > 0 the following inequality log R(h ˆR(h 2 δ 2m holds with probability at least 1 δ. Proof: Taking 1 2 2mɛ2 1 δ, or δ 2e 2mɛ2 in Equality (1 we obtain ( P ˆR(h R(h < ɛ 1 δ, log for ɛ = 2 δ 2m. Note: The inequality of the corollary is an inequality involving random variables not numbers because h is a randomly chosen hypothesis in H 6 / 1
Finite Hypothesis Space - The Inconsistent Case Tossing a Coin Example Let p be the probability that a biased coin that lands heads. Let h be the hypothesis be the one that always guesses tails. The generalization error rate is R(h = p and let ˆR(h = ˆp, where ˆp is the empirical probability of heads based on the training sample drawn iid. Thus, with a probability of at least 1 δ we have log 2 δ ˆp p 2m. 7 / 1
Finite Hypothesis Space - The Inconsistent Case Learning bound finite H, inconsistent case Theorem Let H be a finite hypothesis set. For any δ > 0, the inequality log H + log ( h H R(h ˆR(h 2 δ 2m holds with probability at least 1 δ. Remark: this is a uniform bound (it applies to all hypotheses in H. 8 / 1
Finite Hypothesis Space - The Inconsistent Case Proof Let H = {h 1,..., h H } be the set of hypotheses. We have: ( P ( h H R(h ˆR(h > ɛ ( = P ( R(h 1 ˆR(h 1 > ɛ ( R(h H ˆR(h H > ɛ ( P R(h ˆR(h > ɛ h H 2 H e 2mɛ2. 9 / 1
Finite Hypothesis Space - The Inconsistent Case Proof (cont d Thus, we have ( P ( h H R(h ˆR(h > ɛ 2 H e 2mɛ2. Choosing δ = 2 H e 2mɛ2 it follows that log δ = log 2 + log H 2mɛ 2, so log 2+log H log δ log H +log ɛ = 2m = 2 δ 2m. With these choices we have: ( P ( h H R(h ˆR(h > ɛ δ, which amounts to the inequality of the theorem: ( P ( h H R(h ˆR(h < ɛ 1 δ. 10 / 1
Finite Hypothesis Space - The Inconsistent Case Previous theorem stipulates that for a finite hypothesis set H, we have ( log2 H R(h ˆR(h + O m Note that log 2 H is the number of bits needed to represent H ; this point to Occam s principle: a smaller hypothesis space size is better; a larger sample size m guarantees better generalization; for the inconsistent size, a larger sample size is required to obtain the same guarantee as in the consistent case (R(h S 1 ɛ (log H + log 1 δ. 11 / 1
Deterministic versus stochastic scenario The Stochastic Scenario Example the distribution D is defined now on X Y (in the deterministic scenario it was defined just on X ; the training data is a sample S = {(x 1, y 1,..., (x m, y m }, where (x i, y i are iid random variables; the output label y i is a probabilistic function of the input. If we try to predict the gender of a person based on weight and height, the result (male, or female is not unique. 12 / 1
Deterministic versus stochastic scenario Agnostic PAC-algorithms Definition Let H be a hypothesis set. An algorithm A is an agnostic PAC-algorithm if there exists a polynomial function such that of any ɛ > 0 and δ > 0 we have ( P R(h S min R(h < ɛ 1 δ h H for every sample of size ( 1 m ɛ, 1 δ and for all probability distributions D over X Y. If A runs in time polynomial in 1 ɛ, 1 δ, then A is an efficient agnostic PAC-algorithm. 13 / 1
Bayes Error and Noise Definition Given a distribution D over X Y, the Bayes error R is R = inf{r(h his measurable }. A hypothesis h such that R(h = R is called a Bayes hypothesis and denoted by h Bayes. 14 / 1
Bayes Error and Noise in the deterministic case R = 0; in the stochastic case we may have R 0; using conditional probabilities the Bayes hypothesis can be defined by ( xh Bayes (x = argmax y {0,1} P(y x, which means that the class y is the most probable class a posteriori, that is, after seeing the data x; the average error made by h Bayes on x is min{p(1 x, P(0 x}. 15 / 1
Bayes Error and Noise Definition Given a distribution D the noise at x is noise(x = min{p(1 x, P(0 x}. The average noise at x is E(noise(x. The average noise is the Bayes error: E(noise(x = R. The noise indicates the level of difficulty of the learning task. A point x X with noise(x = 0.5 is said to be noisy. 16 / 1
Bayes Error and Noise Estimation and Approximation Errors R(h the error of hypothesis h; R = inf{r(h his measurable } is the Bayes error; h is the hypothesis in H with minimal error (best in class hypothesis. It always exists when H if finite; if this is not the case, instead of R(h we can use inf h H R(h. By definition, R(h R(h R. 17 / 1
Bayes Error and Noise Since R(h R(h R, we can define: estimation error: R(h R(h ; it depends on the hypothesis h selected; approximation error: R(h R : it measures how well the Bayes error can be approximated using H. R(h R = (R(h R(h + (R(h R. 18 / 1
Bayes Error and Noise Empirical Risk Minimization ERM Definition An algorithm that returns a hypothesis hs ERM error ˆR(h is said to be an ERM algorithm. with the smallest empirical We have R(h ERM S R(h = (R(h ERM ˆR(h S ERM + ( ˆR(h S ERM R(h (R(h ERM ˆR(h S ERM + ( ˆR(h R(h 2 sup ˆR(h R(h. h H Note that: Since h is the hypothesis in H with minimal error (best in class hypothesis, R(h decreases with H. log R(h ˆR(h 2 δ 2m and increases with H. 19 / 1