Statistical Learning. Philipp Koehn. 10 November 2015

Similar documents
From inductive inference to machine learning

Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Introduction to Artificial Intelligence. Learning from Oberservations

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 4 1

Learning with Probabilities

Learning and Neural Networks

Learning Decision Trees

Bayesian learning Probably Approximately Correct Learning

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

16.4 Multiattribute Utility Functions

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Introduction to Machine Learning

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Bayesian Learning (II)

CSC242: Intro to AI. Lecture 23

Notes on Machine Learning for and

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Decision Trees. Ruy Luiz Milidiú

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

MODULE -4 BAYEIAN LEARNING

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Lecture 3: Decision Trees

CSC 411 Lecture 3: Decision Trees

CSCE 478/878 Lecture 6: Bayesian Learning

Lecture 3: Decision Trees

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Algorithmisches Lernen/Machine Learning

Mathematical Formulation of Our Example

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 9: Bayesian Learning

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

EECS 349:Machine Learning Bryan Pardo

Bayesian Learning Extension

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Bayesian Learning

Learning Decision Trees

Learning Decision Trees

Logistic Regression. Machine Learning Fall 2018

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

ECE521 week 3: 23/26 January 2017

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Machine Learning Midterm Exam

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability Based Learning

Naïve Bayes classification

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Artificial Intelligence Roman Barták

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Introduction to Machine Learning Midterm Exam Solutions

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Tree Learning

Decision Trees. Tirgul 5

Machine Learning (CS 567) Lecture 2

Classification Algorithms

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Algorithms for Classification: The Basic Methods

CS 6375 Machine Learning

CHAPTER-17. Decision Tree Induction

Classification: Decision Trees

Statistics and learning: Big Data

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning 4771

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

PMR Learning as Inference

CMU-Q Lecture 24:

Introduction to Machine Learning CMU-10701

Machine learning. This can raise two sorts of doubts.

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Decision Tree Learning

Bayesian Methods: Naïve Bayes

Learning Bayesian network : Given structure and completely observed data

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning Linear Classification. Prof. Matteo Matteucci

Midterm, Fall 2003

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generative v. Discriminative classifiers Intuition

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

CS 6375 Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

the tree till a class assignment is reached

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

PATTERN RECOGNITION AND MACHINE LEARNING

Least Squares Regression

Transcription:

Statistical Learning Philipp Koehn 10 November 2015

Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete data linear regression

2 learning agents

Learning 3 Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent s decision mechanisms to improve performance

Learning Agents 4

Learning Element 5 Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios:

Feedback 6 Supervised learning correct answer for each instance given try to learn mapping x f(x) Reinforcement learning occasional rewards, delayed rewards still needs to learn utility of intermediate actions Unsupervised learning density estimation learns distribution of data points, maybe clusters

What are we Learning? 7 Assignment to a class (maybe just binary yes/no decision) Classification Real valued number Regression

Inductive Learning 8 Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair x, f(x), e.g., Problem: find a(n) hypothesis h such that h f given a training set of examples O O X X X, +1 This is a highly simplified model of real learning Ignores prior knowledge Assumes a deterministic, observable environment Assumes examples are given Assumes that the agent wants to learn f

Inductive Learning Method 9 Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive Learning Method 10 Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive Learning Method 11 Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive Learning Method 12 Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive Learning Method 13 Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Ockham s razor: maximize a combination of consistency and simplicity

14 decision trees

Attribute-Based Representations 15 Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won t wait for a table: Example Attributes Target Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait X 1 T F F T Some $$$ F T French 0 10 T X 2 T F F T Full $ F F Thai 30 60 F X 3 F T F F Some $ F F Burger 0 10 T X 4 T F T T Full $ F F Thai 10 30 T X 5 T F T F Full $$$ F T French >60 F X 6 F T F T Some $$ T T Italian 0 10 T X 7 F T F F None $ T F Burger 0 10 F X 8 F F F T Some $$ T T Thai 0 10 T X 9 F T T F Full $ T F Burger >60 F X 10 T T T T Full $$$ F T Italian 10 30 F X 11 F F F F None $ F F Thai 0 10 F X 12 T T T T Full $ F F Burger 30 60 T Classification of examples is positive (T) or negative (F)

Decision Trees 16 One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait:

Expressiveness 17 Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Prefer to find more compact decision trees

Hypothesis Spaces 18 How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows= 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set may get worse predictions

Decision Tree Learning 19 Aim: find a small tree consistent with the training examples Idea: (recursively) choose most significant attribute as root of (sub)tree function DTL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return MODE(examples) else best CHOOSE-ATTRIBUTE(attributes, examples) tree a new decision tree with root test best for each value v i of best do examples i {elements of examples with best = v i } subtree DTL(examples i, attributes best, MODE(examples)) add a branch to tree with label v i and subtree subtree return tree

Choosing an Attribute 20 Idea: a good attribute splits the examples into subsets that are (ideally) all positive or all negative P atrons? is a better choice gives information about the classification

Information 21 Information answers questions The more clueless I am about the answer initially, the more information is contained in the answer Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5 Information in an answer when prior is P 1,..., P n is (also called entropy of the prior) H( P 1,..., P n ) = n P i log 2 P i i = 1

Information 22 Suppose we have p positive and n negative examples at the root H( p/(p + n), n/(p + n) ) bits needed to classify a new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit An attribute splits the examples E into subsets E i each needs less information to complete the classification Let E i have p i positive and n i negative examples H( p i /(p i + n i ), n i /(p i + n i ) ) bits needed to classify a new example expected number of bits per example over all branches is i p i + n i p + n H( p i/(p i + n i ), n i /(p i + n i ) )

Select Attribute 23 0 bit 0 bit.918 bit 1 bit 1 bit 1 bit 1 bit P atrons?: 0.459 bits T ype: 1 bit Choose attribute that minimizes remaining information needed

Example 24 Decision tree learned from the 12 examples: Substantially simpler than true tree (a more complex hypothesis isn t justified by small amount of data)

25 performance measurements

Performance Measurement 26 How do we know that h f? (Hume s Problem of Induction) Use theorems of computational/statistical learning theory Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size

Performance Measurement 27 Learning curve depends on realizable (can express target function) vs. non-realizable non-realizability can be due to missing attributes or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes)

Overfitting 28

29 bayesian learning

Full Bayesian Learning 30 View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1, h 2,..., prior P(H) jth observation d j gives the outcome of random variable D j training data d = d 1,..., d N Given the data so far, each hypothesis has a posterior probability: where P (d h i ) is called the likelihood P (h i d) = αp (d h i )P (h i ) Predicting next data point uses likelihood-weighted average over hypotheses: P(X d) = i P(X d, h i )P (h i d) = i P(X h i )P (h i d)

Example 31 Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be?

Posterior Probability of Hypotheses 32

Prediction Probability 33

Maximum A-Posteriori Approximation 34 Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h i d) I.e., maximize P (d h i )P (h i ) or log P (d h i ) + log P (h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P (d h i ) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis

Maximum Likelihood Approximation 35 For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (d h i ) Simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method

ML Parameter Learning in Bayes Nets 36 Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses h θ θ is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so N P (d h θ ) = P (d j h θ ) = θ c (1 θ) l j = 1 Maximize this w.r.t. θ which is easier for the log-likelihood: N L(d h θ ) = log P (d h θ ) = log P (d j h θ ) = c log θ + l log(1 θ) j = 1 dl(d h θ ) dθ = c θ l 1 θ = 0 θ = c c + l = c N

Multiple Parameters 37 Red/green wrapper depends probabilistically on flavor Likelihood for, e.g., cherry candy in green wrapper P (F = cherry, W = green h θ,θ1,θ 2 ) = P (F = cherry h θ,θ1,θ 2 )P (W = green F = cherry, h θ,θ1,θ 2 ) = θ (1 θ 1 ) N candies, r c red-wrapped cherry candies, etc.: P (d h θ,θ1,θ 2 ) = θ c (1 θ) l θ r c 1 (1 θ 1) g c θ r l 2 (1 θ 2) g l L = [c log θ + l log(1 θ)] + [r c log θ 1 + g c log(1 θ 1 )] + [r l log θ 2 + g l log(1 θ 2 )]

Multiple Parameters 38 Derivatives of L contain only the relevant parameter: L θ = c θ l 1 θ = 0 θ = c c + l L θ 1 = r c θ 1 L θ 2 = r l θ 2 g c 1 θ 1 = 0 θ 1 = g l 1 θ 2 = 0 θ 2 = r c r c + g c r l r l + g l With complete data, parameters can be learned separately

Regression: Gaussian Models 39 Maximizing P (y x) = 1 2πσ e (y (θ 1 x+θ 2 ))2 2σ 2 w.r.t. θ 1, θ 2 = minimizing E = N (y j (θ 1 x j + θ 2 )) 2 j = 1 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance

Many Attributes 40 Recall the wait for table? example: decision depends on has-bar, hungry?, price, weather, type of restaurant, wait time,... Data point d = (d 1, d 2, d 3,..., d n ) T is high-dimensional vector P (d h) is very sparse Naive Bayes P (d h) = P (d 1, d 2, d 3,..., d n h) = i (independence assumption between all attributes) P (d i h)

How To 41 1. Choose a parameterized family of models to describe the data requires substantial insight and sometimes new models 2. Write down the likelihood of the data as a function of the parameters may require summing over hidden variables, i.e., inference 3. Write down the derivative of the log likelihood w.r.t. each parameter 4. Find the parameter values such that the derivatives are zero may be hard/impossible; modern optimization techniques help

Summary 42 Learning needed for unknown environments Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation Supervised learning Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Bayesian learning