Bayesian learning Probably Approximately Correct Learning

Bayesian learning Probably Approximately Correct Learning Peter Antal antal@mit.bme.hu A.I. December 1, 2017 1

Learning paradigms Bayesian learning Falsification hypothesis testing approach Probably Approximately Correct learning Decision-tree/list learning A.I. December 1, 2017 2

Epicurus' (342? B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data. The principle of Occam's razor (1285-1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.

A.I. December 1, 2017 4

A.I. December 1, 2017 5

Russel&Norvig: Artificial intelligence, ch.20

Russel&Norvig: Artificial intelligence

sequential likelihood of a given data 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 h1 h2 h3 h4 h5 A.I. December 1, 2017 13

probability of summary stastitics Cherry p(cherry 1) pˆ (Cherry 1 DN Estimation error Data generation Binomial distribution with n,p Confidence intervals directly or using approximations Relative frequencies: ) 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 h2 h3 h4 pˆ (Cherry 1 D ) NC 1 N / N Asymptotic convergence: law of large numbers Asymptotic convergence speed: Central limit theorem p(cherry 1) NC 1 / N Convergence bounds for finite(!) data (ε accuracy,δ confidence) sample complexity: N ε,δ p( D ˆ N : pd (C 1) p(c 1) ) N A.I. December 1, 2017 14

Terminology: Null hypothesis (H 0 ): tested model Type I error/error of the first kind/α error: p(h 0 rejected H 0 holds) Specificity: p(h 0 not rejected H 0 holds) =1-α Significance: α p-value: probability of more extreme observations in repeated experiments Type II error/error of the second kind/β error: p(h 0 is rejected H 0 does not hold) Power or sensitivity: p(h 0 is not rejected H 0 does not hold) = 1-β reported Ref. H 0 holds Ref.:H 0 does not hold H 0 not rejected H 0 rejected Type I ( false rejection ) Type II

Frequentist Bayesian - Prior probabilities Null hypothesis - Indirect: proving by refutation Model selection Direct Model averaging Likelihood ratio test Bayes factor p-value -! -! Posterior probabilities Confidence interval Credible region Significance level Optimal decision based on Exp.Util. Multiple testing problem Optimal correction Regularization Non-informative prior

The Probably Approximately Correct PAC-learning A single estimate of the expected error for a given hypothesis is convergent, but can we estimate the errors for all hypotheses uniformly well?? Example from concept learning X: i.i.d. samples. n: sample size H: hypotheses bad

Assume that the true hypothesis f is element of the hypothesis space H. Define the error of a hypothesis h as its misclassification rate: error h = p(h(x) f(x)) Hypothesis h is approximately correct if (ε is the accuracy ) error h < ε For h H bad error h > ε

H can be separated to H <ε and H bad as H ε< bad By definition for any h H bad, the probability of error is larger than ε thus the probability of no error is less than ( 1)

Thus for m samples for a h b H bad : p D n :h b x = f x (1 ε) n For any h b H bad, this can be bounded as p D n : h b H, h b x = f x H bad 1 ε n H (1 ε) n

To have at least δ probability of approximate correctness: H (1 ε) n δ By expressing the sample size as function of ε accuracy and δ confidence we get a bound for sample complexity 1/ε(ln H + ln 1 δ ) n

Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Examples described by attribute values (Boolean, discrete, continuous) E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F)

One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait:

Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples Prefer to find more compact decision trees

How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set may get worse predictions

Aim: find a small tree consistent with the training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree

Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice

To implement Choose-Attribute in the DTL algorithm Information Content (Entropy): I(P(v 1 ),, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) For a training set containing p positive examples and n negative examples: I( p p n, n ) p n p p n p n log 2 log 2 p n p n n p n

A chosen attribute A divides the training set E into subsets E 1,, E v according to their values for A, where A has v distinct values. Information Gain (IG) or reduction in entropy from the attribute test: Choose the attribute with the largest IG v i i i i i i i i i n p n n p p I n p n p A remainder 1 ), ( ) ( ) ( ), ( ) ( A remainder n p n n p p I A IG

For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): 2 4 6 2 4 IG( Patrons) 1[ I(0,1) I(1,0) I(, )].0541bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 IG( Type) 1[ I(, ) I(, ) I(, ) I( 12 2 2 12 2 2 12 4 4 12 4 2, )] 4 0 bits Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Decision tree learned from the 12 examples: Substantially simpler than true tree---a more complex hypothesis isn t justified by small amount of data

Total error In practice, the target typically is not inside the hypothesis space: the total real error can be decomposed to bias + variance bias : expected error/modelling error variance : estimation/empirical selection error For a given sample size the error is decomposed: Modeling error Statistical error (Model selection error) Total error Model complexity

Sequential k tests using n attributes: k-dl(n) Number of tests: Conj( n, k) Number of test sequences: Conj( n, k) Number of decision lists: 3 k i0 2n i O( n k ) k DL( n) 3 Conj( n, k ) Conj( n, k)! A.I. December 1, 2017 35

Number of decision lists: k DL( n) 2 k O( n log2 ( n k )) PAC sample complexity: 1 1 m (ln O( n k log 2 ( n k ))) A.I. December 1, 2017 36