CS 380: ARTIFICIAL INTELLIGENCE

Similar documents
Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Statistical Learning. Philipp Koehn. 10 November 2015

Introduction to Artificial Intelligence. Learning from Oberservations

From inductive inference to machine learning

Learning and Neural Networks

Learning Decision Trees

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Bayesian learning Probably Approximately Correct Learning

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

16.4 Multiattribute Utility Functions

Decision Trees. Ruy Luiz Milidiú

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

EECS 349:Machine Learning Bryan Pardo

Introduction to Machine Learning

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Learning from Examples

CSC 411 Lecture 3: Decision Trees

Introduction To Artificial Neural Networks

Artificial Intelligence Roman Barták

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Decision Trees. Tirgul 5

Machine learning. This can raise two sorts of doubts.

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CSC242: Intro to AI. Lecture 23

Classification Algorithms

Machine learning. The general model. Routine learning. General learning schemes

Incremental Stochastic Gradient Descent

Classification: Decision Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Lecture 3: Decision Trees

Learning Decision Trees

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Learning. AI Slides (5e) c Lin

CS 6375 Machine Learning

Learning Decision Trees

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Classification Algorithms

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Decision Tree Learning

Statistics and learning: Big Data

the tree till a class assignment is reached

Inductive Learning. Inductive hypothesis h Hypothesis space H size H. Example set X. h: hypothesis that. size m Training set D

Decision Tree Learning

Introduction to machine learning. Concept learning. Design of a learning system. Designing a learning system

Qualifying Exam in Machine Learning

Decision Trees (Cont.)

EEE 241: Linear Systems

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CC283 Intelligent Problem Solving 28/10/2013

Notes on Machine Learning for and

Introduction to Machine Learning. CS171, Fall 2017 Introduction to Artificial Intelligence TA Edwin Solares

Lecture 7: DecisionTrees

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMPUTATIONAL LEARNING THEORY

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning 2nd Edi7on

Decision Trees / NLP Introduction

Decision Tree Learning - ID3

Course 395: Machine Learning

Decision Tree Learning and Inductive Inference

Classification and Prediction

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CS 6375 Machine Learning

Artificial Intelligence Programming Probability

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Decision Tree Learning

Decision Tree Learning. Dr. Xiaowei Huang

Machine Learning 2010

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Alpha-Beta Pruning: Algorithm and Analysis

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Introduction to Machine Learning CMU-10701

Induction of Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Linear Classifiers: Expressiveness

Alpha-Beta Pruning: Algorithm and Analysis

Neural Networks Introduction CIS 32

Decision Trees.

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Tree Learning

Decision Trees.

Transcription:

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING 11/11/2013 Santiago Ontañón santi@cs.drexel.edu https://www.cs.drexel.edu/~santi/teaching/2013/cs380/intro.html

Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search Logic and Knowledge Representation Predicate Logic First-order Logic Today: Machine Learning

What is Learning?

Machine Learning Computational methods for computers to exhibit specific forms of learning. For example: Learning from Examples: Supervised learning Unsupervised learning Reinforcement Learning Learning from Observation (demonstration/imitation)

Examples Supervised Learning: learning to recognize writing

Examples Unsupervised Learning: clustering observations into meaningful classes

Examples Reinforcement Learning: learning to walk

Examples Learning from demonstration: performing tasks that other agents (or humans) can do

Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent s decision mechanisms to improve performance Chapter 18, Sections 1 3 3

Performance standard Learning agents Critic Sensors feedback Agent learning goals Learning element Problem generator changes knowledge experiments Performance element Effectors Environment Chapter 18, Sections 1 3 4

Learning element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios: Performance element Component Representation Feedback Alpha beta search Eval. fn. Weighted linear function Win/loss Logical agent Transition model Successor state axioms Outcome Utility based agent Transition model Dynamic Bayes net Outcome Simple reflex agent Percept action fn Neural net Correct action Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards Chapter 18, Sections 1 3 5

Inductive learning (a.k.a. Science) Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair x, f(x), e.g., O O X X X, +1 Problem: find a(n) hypothesis h such that h f given a training set of examples (This is a highly simplified model of real learning: Ignores prior knowledge Assumes a deterministic, observable environment Assumes examples are given Assumes that the agent wants to learn f why?) Chapter 18, Sections 1 3 6

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Chapter 18, Sections 1 3 7

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Chapter 18, Sections 1 3 8

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Chapter 18, Sections 1 3 9

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Chapter 18, Sections 1 3 10

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Chapter 18, Sections 1 3 11

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Ockham s razor: maximize a combination of consistency and simplicity Chapter 18, Sections 1 3 12

Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won t wait for a table: Example Attributes Target Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait X 1 T F F T Some $$$ F T French 0 10 T X 2 T F F T Full $ F F Thai 30 60 F X 3 F T F F Some $ F F Burger 0 10 T X 4 T F T T Full $ F F Thai 10 30 T X 5 T F T F Full $$$ F T French >60 F X 6 F T F T Some $$ T T Italian 0 10 T X 7 F T F F None $ T F Burger 0 10 F X 8 F F F T Some $$ T T Thai 0 10 T X 9 F T T F Full $ T F Burger >60 F X 10 T T T T Full $$$ F T Italian 10 30 F X 11 F F F F None $ F F Thai 0 10 F X 12 T T T T Full $ F F Burger 30 60 T Classification of examples is positive (T) or negative (F) Chapter 18, Sections 1 3 13

Decision trees One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? T No Yes No Yes Reservation? Fri/Sat? T Alternate? No Yes No Yes No Yes Bar? T F T T Raining? No Yes No Yes F T F T Chapter 18, Sections 1 3 14

Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: A B A xor B F F F F T T T F T T T F F F B A F T B T F T T T F Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Prefer to find more compact decision trees Chapter 18, Sections 1 3 15

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? Chapter 18, Sections 1 3 16

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions Chapter 18, Sections 1 3 17

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows Chapter 18, Sections 1 3 18

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n Chapter 18, Sections 1 3 19

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees Chapter 18, Sections 1 3 20

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Chapter 18, Sections 1 3 21

Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set may get worse predictions Chapter 18, Sections 1 3 22

Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose most significant attribute as root of (sub)tree function DTL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Mode(examples) else best Choose-Attribute(attributes, examples) tree a new decision tree with root test best for each value v i of best do examples i {elements of examples with best = v i } subtree DTL(examples i, attributes best, Mode(examples)) add a branch to tree with label v i and subtree subtree return tree Chapter 18, Sections 1 3 23

Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) all positive or all negative Patrons? Type? None Some Full French Italian Thai Burger P atrons? is a better choice gives information about the classification Chapter 18, Sections 1 3 24

Information answers questions Information The more clueless I am about the answer initially, the more information is contained in the answer Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5 Information in an answer when prior is P 1,..., P n is H( P 1,..., P n ) = Σ n i = 1 P i log 2 P i (also called entropy of the prior) Chapter 18, Sections 1 3 25

Information contd. Suppose we have p positive and n negative examples at the root H( p/(p+n), n/(p+n) ) bits needed to classify a new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit An attribute splits the examples E into subsets E i, each of which (we hope) needs less information to complete the classification Let E i have p i positive and n i negative examples H( p i /(p i +n i ), n i /(p i +n i ) ) bits needed to classify a new example expected number of bits per example over all branches is Σ i p i + n i p + n H( p i/(p i + n i ), n i /(p i + n i ) ) For P atrons?, this is 0.459 bits, for T ype this is (still) 1 bit choose the attribute that minimizes the remaining information needed Chapter 18, Sections 1 3 26

Example contd. Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? T No Yes F T Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data Chapter 18, Sections 1 3 27

Performance measurement How do we know that h f? (Hume s Problem of Induction) 1) Use theorems of computational/statistical learning theory 2) Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size % correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size Chapter 18, Sections 1 3 28

Performance measurement contd. Learning curve depends on realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes) % correct 1 realizable redundant nonrealizable # of examples Chapter 18, Sections 1 3 29

Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis that is approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Chapter 18, Sections 1 3 30