From inductive inference to machine learning

Similar documents
Statistical Learning. Philipp Koehn. 10 November 2015

Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Introduction to Artificial Intelligence. Learning from Oberservations

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1

Learning and Neural Networks

Bayesian learning Probably Approximately Correct Learning

Learning Decision Trees

Statistical learning. Chapter 20, Sections 1 4 1

Learning with Probabilities

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

CSC242: Intro to AI. Lecture 23

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

16.4 Multiattribute Utility Functions

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Decision Trees. Ruy Luiz Milidiú

Introduction to Machine Learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Bayesian Learning Extension

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

EECS 349:Machine Learning Bryan Pardo

CSC 411 Lecture 3: Decision Trees

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

MODULE -4 BAYEIAN LEARNING

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Decision Trees. Tirgul 5

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Learning from Examples

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 3: Decision Trees

Lecture 3: Decision Trees

Learning Decision Trees

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Learning Decision Trees

CSCE 478/878 Lecture 6: Bayesian Learning

Notes on Machine Learning for and

Decision Tree Learning. Dr. Xiaowei Huang

Decision Tree Learning

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Classification: Decision Trees

Introduction To Artificial Neural Networks

CS 6375 Machine Learning

Classification Algorithms

Lecture 9: Bayesian Learning

Classification Algorithms

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Introduction to Bayesian Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Statistics and learning: Big Data

Bayesian Learning (II)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Computing Posterior Probabilities. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Decision Tree Learning

Artificial Intelligence Roman Barták

Decision Tree Learning

Algorithmisches Lernen/Machine Learning

Decision Tree Learning - ID3

Machine learning. This can raise two sorts of doubts.

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Decision Tree Learning

Linear Classifiers: Expressiveness

Incremental Stochastic Gradient Descent

CHAPTER-17. Decision Tree Induction

Machine learning. The general model. Routine learning. General learning schemes

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees (Cont.)

Introduction to Machine Learning CMU-10701

Machine Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

COMPUTATIONAL LEARNING THEORY

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Decision Tree Learning

Machine Learning (CS 567) Lecture 2

Machine Learning 4771

Quantifying uncertainty & Bayesian networks

Outline. Introduction. Bayesian Probability Theory Bayes rule Bayes rule applied to learning Bayesian learning and the MDL principle

Lecture 2 Machine Learning Review

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Dan Roth 461C, 3401 Walnut

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Probability Based Learning

Midterm, Fall 2003

Lecture : Probabilistic Machine Learning

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Transcription:

From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1

Outline Bayesian inferences with multiple models Bayesian learning Maximum aposterioriand maximum likelihood learning Bayes net learning ML parameter learning with complete data linear regression Inductive learning Decision tree learning Measuring learning performance AIMA: Inductive inference 2

Bayesian inference with multiple models Assume multiple models M i =(S i,θ i ) with prior p(m i ) i =1,...,M. The inference p(q = q E = e) can be performed as follows: p(q e) =Σ i=1,...,m p(q, M i e) =Σ i=1,...,m p(q M i,e)p(m i e) Note that p(m i e) is a posterior over models with evidence e: p(m i e) = p(e M i)p(m i ) p(e) p(e M i )p(m i ) i.e., the evidence e reweight our beliefs in multiple models. The inference is performed by Bayesian Model Averaging (BMA). Epicurus (342(?) B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data. AIMA: Inductive inference 3

Bayesian model averaging with data Beside models, assume N multiple complete observations D N. The standard inference p(q = q E = e, D N ) is defined as: p(q e, D N )=Σ i=1,...,m p(q, M i e, D N )=Σ i=1,...,m p(q M i,e,d N )p(m i e, D N ) Because p(q M i,e,d N )=p(q M i,e) and p(m i e, D N ) p(m i D N ): p(q e, D N ) Σ i=1,...,m p(q M i,e)p(m i D N ) where again p(m i D N ) is a posterior after observations D N : p(m i D N )= p(d N M i )p(m i ) p(e) p(d N M i ) p(m }{{} i ). }{{} likelihood prior i.e., our rational foundation, probability theory, automatically includes and normatively defines learning from observations as standard Bayesian inference! AIMA: Inductive inference 4

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1,h 2,...,priorP(H) jth observation d j gives the outcome of random variable D j training data d = d 1,...,d N Given the data so far, each hypothesis has a posterior probability: P (h i d) =αp (d h i )P (h i ) where P (d h i ) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: P(X d) =Σ i P(X d,h i )P (h i d) =Σ i P(X h i )P (h i d) No need to pick one best-guess hypothesis! AIMA: Inductive inference 5

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? AIMA: Inductive inference 6

Posterior probability of hypotheses Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) 0 2 4 6 8 10 Number of samples in d AIMA: Inductive inference 7

Prediction probability 1 P(next candy is lime d) 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d AIMA: Inductive inference 8

MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h i d) I.e., maximize P (d h i )P (h i ) or log P (d h i )+logp (h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P (d h i ) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis (cf. science) AIMA: Inductive inference 9

ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (d h i ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method AIMA: Inductive inference 10

ML parameter learning in Bayes nets Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses h θ θ is a parameter for this simple (binomial) model. P( F=cherry) θ Flavor Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, P (d h θ )= N j =1 P (d j h θ )=θ c (1 θ) l Maximize this w.r.t. θ which is easier for the log-likelihood: L(d h θ )=logp (d h θ )= N dl(d h θ ) dθ j =1 log P (d j h θ )=c log θ + l log(1 θ) = c θ l 1 θ =0 θ = c c + l = c N AIMA: Inductive inference 11

Inductive learning (a.k.a. Science) Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair x, f(x), e.g., O O X X X, +1 Problem: find a(n) hypothesis h such that h f given a training set of examples. (This is a highly simplified model of real learning: Ignores prior knowledge Assumes a deterministic, observable environment Assumes examples are given Assumes that the agent wants to learn f why?) AIMA: Inductive inference 12

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x AIMA: Inductive inference 13

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x AIMA: Inductive inference 14

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x AIMA: Inductive inference 15

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x AIMA: Inductive inference 16

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x AIMA: Inductive inference 17

Ockham s razor Ockham s razor: balance consistency and simplicity The principle of Occam s razor (1285-1349, sometimes spelt Ockham). Occam s razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability. AIMA: Inductive inference 18

Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.), e.g., situations where I will/won t wait for a table: Example Attributes Target Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X 1 T F F T Some $$$ F T French 0 10 T X 2 T F F T Full $ F F Thai 30 60 F X 3 F T F F Some $ F F Burger 0 10 T X 4 T F T T Full $ F F Thai 10 30 T X 5 T F T F Full $$$ F T French >60 F X 6 F T F T Some $$ T T Italian 0 10 T X 7 F T F F None $ T F Burger 0 10 F X 8 F F F T Some $$ T T Thai 0 10 T X 9 F T T F Full $ T F Burger >60 F X 10 T T T T Full $$$ F T Italian 10 30 F X 11 F F F F None $ F F Thai 0 10 F X 12 T T T T Full $ F F Burger 30 60 T Classification of examples is positive (T) or negative (F) AIMA: Inductive inference 19

Decision trees Common representation for protocols, e.g. here is the true tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? T No Yes No Yes Reservation? Fri/Sat? T Alternate? No Yes No Yes No Yes Bar? T F T T Raining? No Yes No Yes F T F T AIMA: Inductive inference 20

Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: A B A xor B F F F F T T T F T T T F F F B A F T B T F T T T F Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x), but it probably won t generalize to new examples. Prefer to find more compact decision trees. AIMA: Inductive inference 21

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? AIMA: Inductive inference 22

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions AIMA: Inductive inference 23

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows AIMA: Inductive inference 24

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n AIMA: Inductive inference 25

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees AIMA: Inductive inference 26

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? AIMA: Inductive inference 27

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set may get worse predictions AIMA: Inductive inference 28

Decision tree learning Aim: find a small tree consistent with the training examples Idea: recursively choose most significant attribute to branch function DTL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Mode(examples) else best Choose-Attribute(attributes, examples) tree a new decision tree with root test best for each value v i of best do examples i {elements of examples with best = v i } subtree DTL(examples i, attributes best, Mode(examples)) add a branch to tree with label v i and subtree subtree return tree AIMA: Inductive inference 29

Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) all positive or all negative Patrons? Type? None Some Full French Italian Thai Burger Patrons? is a better choice gives information about the classification AIMA: Inductive inference 30

Information answers questions Information The more clueless I am about the answer initially, the more information is contained in the answer Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5 Informationinananswerwhenprioris P 1,...,P n is H( P 1,...,P n )=Σ n i =1 P i log 2 P i (also called entropy of the prior) AIMA: Inductive inference 31

Information contd. Suppose we have p positive and n negative examples at the root H( p/(p+n),n/(p+n) ) bits needed to classify a new example E.g., for 12 restaurant examples, p = n =6 so we need 1 bit An attribute splits the examples E into subsets E i,eachofwhich (we hope) needs less information to complete the classification Let E i have p i positive and n i negative examples H( p i /(p i + n i ),n i /(p i + n i ) ) bits needed to classify a new example expected number of bits per example over all branches is Σ i p i + n i p + n H( p i/(p i + n i ),n i /(p i + n i ) ) For Patrons?, this is 0.459 bits, for Type this is (still) 1 bit choose the attribute that minimizes the remaining information needed AIMA: Inductive inference 32

Example contd. Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? T No Yes F T Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data AIMA: Inductive inference 33

Decision tree as local conditional model AIMA: Inductive inference 34

Performance measurement How do we know that h f? (Hume s Problem of Induction) 1) Use theorems of computational/statistical learning theory 2) Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set for increasing training set size % correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size AIMA: Inductive inference 35

Performance measurement contd. Learning curve depends on realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes) % correct 1 realizable redundant nonrealizable # of examples AIMA: Inductive inference 36

Summary Learning needed for unknown environments, lazy designers Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets For supervised learning, the aim is to find a simple hypothesis that is approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set AIMA: Inductive inference 37