Bayesian learning Probably Approximately Correct Learning

Similar documents
From inductive inference to machine learning

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Learning from Observations. Chapter 18, Sections 1 3 1

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE

Decision Trees. Ruy Luiz Milidiú

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Statistical Learning. Philipp Koehn. 10 November 2015

Learning and Neural Networks

Introduction to Artificial Intelligence. Learning from Oberservations

Learning Decision Trees

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Introduction to Machine Learning

Classification Algorithms

Classification Algorithms

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Learning from Examples

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

16.4 Multiattribute Utility Functions

Machine learning & data science. Peter Antal

Introduction To Artificial Neural Networks

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Incremental Stochastic Gradient Descent

CSC 411 Lecture 3: Decision Trees

Decision Trees. Tirgul 5

Bayesian Learning Extension

EECS 349:Machine Learning Bryan Pardo

Algorithmic Probability

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Lecture 9: Bayesian Learning

Decision Trees (Cont.)

Lecture 3: Decision Trees

MODULE -4 BAYEIAN LEARNING

Lecture 3: Decision Trees

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Artificial Intelligence Roman Barták

Decision Tree Learning. Dr. Xiaowei Huang

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Machine learning. This can raise two sorts of doubts.

Computational Learning Theory

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Machine learning. The general model. Routine learning. General learning schemes

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Classification: Decision Trees

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Introduction to Machine Learning CMU-10701

CH.9 Tests of Hypotheses for a Single Sample

Model Averaging (Bayesian Learning)

CSCE 478/878 Lecture 6: Bayesian Learning

Midterm, Fall 2003

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CSC242: Intro to AI. Lecture 23

Machine Learning

Study Ch. 9.3, #47 53 (45 51), 55 61, (55 59)

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Machine Learning

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

SRNDNA Model Fitting in RL Workshop

Algorithmisches Lernen/Machine Learning

Statistical learning. Chapter 20, Sections 1 3 1

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

CS 6375 Machine Learning

Decision Trees.

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

the tree till a class assignment is reached

Machine Learning

CHAPTER-17. Decision Tree Induction

Introduction to Bayesian Learning. Machine Learning Fall 2018

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Learning Features of Bayesian learning methods:

Learning Decision Trees

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning

Statistical Inference

Decision Tree Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Statistical learning. Chapter 20, Sections 1 3 1

Subject CS1 Actuarial Statistics 1 Core Principles

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Decision Trees.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Basic Concepts of Inference

Decision Tree Learning

Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Computational learning theory. PAC learning. VC dimension.

How do we compare the relative performance among competing models?

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Transcription:

Bayesian learning Probably Approximately Correct Learning Peter Antal antal@mit.bme.hu A.I. December 1, 2017 1

Learning paradigms Bayesian learning Falsification hypothesis testing approach Probably Approximately Correct learning Decision-tree/list learning A.I. December 1, 2017 2

Epicurus' (342? B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data. The principle of Occam's razor (1285-1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.

A.I. December 1, 2017 4

A.I. December 1, 2017 5

Russel&Norvig: Artificial intelligence, ch.20

Russel&Norvig: Artificial intelligence

Russel&Norvig: Artificial intelligence

Russel&Norvig: Artificial intelligence

sequential likelihood of a given data 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 h1 h2 h3 h4 h5 A.I. December 1, 2017 13

probability of summary stastitics Cherry p(cherry 1) pˆ (Cherry 1 DN Estimation error Data generation Binomial distribution with n,p Confidence intervals directly or using approximations Relative frequencies: ) 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 h2 h3 h4 pˆ (Cherry 1 D ) NC 1 N / N Asymptotic convergence: law of large numbers Asymptotic convergence speed: Central limit theorem p(cherry 1) NC 1 / N Convergence bounds for finite(!) data (ε accuracy,δ confidence) sample complexity: N ε,δ p( D ˆ N : pd (C 1) p(c 1) ) N A.I. December 1, 2017 14

Terminology: Null hypothesis (H 0 ): tested model Type I error/error of the first kind/α error: p(h 0 rejected H 0 holds) Specificity: p(h 0 not rejected H 0 holds) =1-α Significance: α p-value: probability of more extreme observations in repeated experiments Type II error/error of the second kind/β error: p(h 0 is rejected H 0 does not hold) Power or sensitivity: p(h 0 is not rejected H 0 does not hold) = 1-β reported Ref. H 0 holds Ref.:H 0 does not hold H 0 not rejected H 0 rejected Type I ( false rejection ) Type II

Frequentist Bayesian - Prior probabilities Null hypothesis - Indirect: proving by refutation Model selection Direct Model averaging Likelihood ratio test Bayes factor p-value -! -! Posterior probabilities Confidence interval Credible region Significance level Optimal decision based on Exp.Util. Multiple testing problem Optimal correction Regularization Non-informative prior

The Probably Approximately Correct PAC-learning A single estimate of the expected error for a given hypothesis is convergent, but can we estimate the errors for all hypotheses uniformly well?? Example from concept learning X: i.i.d. samples. n: sample size H: hypotheses bad

Assume that the true hypothesis f is element of the hypothesis space H. Define the error of a hypothesis h as its misclassification rate: error h = p(h(x) f(x)) Hypothesis h is approximately correct if (ε is the accuracy ) error h < ε For h H bad error h > ε

H can be separated to H <ε and H bad as H ε< bad By definition for any h H bad, the probability of error is larger than ε thus the probability of no error is less than ( 1)

Thus for m samples for a h b H bad : p D n :h b x = f x (1 ε) n For any h b H bad, this can be bounded as p D n : h b H, h b x = f x H bad 1 ε n H (1 ε) n

To have at least δ probability of approximate correctness: H (1 ε) n δ By expressing the sample size as function of ε accuracy and δ confidence we get a bound for sample complexity 1/ε(ln H + ln 1 δ ) n

Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Examples described by attribute values (Boolean, discrete, continuous) E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F)

One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait:

Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples Prefer to find more compact decision trees

How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set may get worse predictions

Aim: find a small tree consistent with the training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree

Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice

To implement Choose-Attribute in the DTL algorithm Information Content (Entropy): I(P(v 1 ),, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) For a training set containing p positive examples and n negative examples: I( p p n, n ) p n p p n p n log 2 log 2 p n p n n p n

A chosen attribute A divides the training set E into subsets E 1,, E v according to their values for A, where A has v distinct values. Information Gain (IG) or reduction in entropy from the attribute test: Choose the attribute with the largest IG v i i i i i i i i i n p n n p p I n p n p A remainder 1 ), ( ) ( ) ( ), ( ) ( A remainder n p n n p p I A IG

For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): 2 4 6 2 4 IG( Patrons) 1[ I(0,1) I(1,0) I(, )].0541bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 IG( Type) 1[ I(, ) I(, ) I(, ) I( 12 2 2 12 2 2 12 4 4 12 4 2, )] 4 0 bits Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Decision tree learned from the 12 examples: Substantially simpler than true tree---a more complex hypothesis isn t justified by small amount of data

Total error In practice, the target typically is not inside the hypothesis space: the total real error can be decomposed to bias + variance bias : expected error/modelling error variance : estimation/empirical selection error For a given sample size the error is decomposed: Modeling error Statistical error (Model selection error) Total error Model complexity

Sequential k tests using n attributes: k-dl(n) Number of tests: Conj( n, k) Number of test sequences: Conj( n, k) Number of decision lists: 3 k i0 2n i O( n k ) k DL( n) 3 Conj( n, k ) Conj( n, k)! A.I. December 1, 2017 35

Number of decision lists: k DL( n) 2 k O( n log2 ( n k )) PAC sample complexity: 1 1 m (ln O( n k log 2 ( n k ))) A.I. December 1, 2017 36