Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Similar documents
Model Averaging (Bayesian Learning)

Stochastic Simulation

Stochastic Simulation

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Propositions. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 5.1, Page 1

CSC321 Lecture 18: Learning Probabilistic Models

Is there an Elegant Universal Theory of Prediction?

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Probability and Information Theory. Sargur N. Srihari

Generative Clustering, Topic Modeling, & Bayesian Inference

Final Examination CS540-2: Introduction to Artificial Intelligence

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 188: Artificial Intelligence Fall 2011

STA 4273H: Statistical Machine Learning

Algorithmisches Lernen/Machine Learning

Mixture of Gaussians Models

PATTERN RECOGNITION AND MACHINE LEARNING

ECE521 Lecture7. Logistic Regression

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3

Probabilistic Classification

Machine Learning and Data Mining. Linear classification. Kalev Kask

Introduction to Machine Learning (67577) Lecture 3

Individuals and Relations

Bayesian Networks. Motivation

Discrete Random Variables (cont.) Discrete Distributions the Geometric pmf

Machine Learning for natural language processing

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

CS 188: Artificial Intelligence Spring Announcements

Generative Techniques: Bayes Rule and the Axioms of Probability

Machine Learning for Signal Processing Bayes Classification

Maximum Entropy and Species Distribution Modeling

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Lecture 1: Probability Fundamentals

Exercises, II part Exercises, II part

From perceptrons to word embeddings. Simon Šuster University of Groningen

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Unit 1: Sequence Models

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Lecture : Probabilistic Machine Learning

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

CS 188: Artificial Intelligence Spring Announcements

Data Warehousing & Data Mining

Bayesian Classification Methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Models in Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian discriminant analysis Naive Bayes

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Bayesian Networks BY: MOHAMAD ALSABBAGH

Gradient-Based Learning. Sargur N. Srihari

Lecture 2 Machine Learning Review

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Naïve Bayes classification

CMU-Q Lecture 24:

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Linear & nonlinear classifiers

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Lecture Notes in Machine Learning Chapter 4: Version space learning

Mining Classification Knowledge

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Statistical aspects of prediction models with high-dimensional data

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Lecture 3: Statistical Decision Theory (Part II)

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Learning with multiple models. Boosting.

Chapter 15. Probability Rules! Copyright 2012, 2008, 2005 Pearson Education, Inc.

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Introduction to Statistical Inference

ECE521 Lectures 9 Fully Connected Neural Networks

Learning from Data: Regression

Logistic Regression. Machine Learning Fall 2018

Probabilistic Partial Evaluation: Exploiting rule structure in probabilistic inference

Behavioral Data Mining. Lecture 2

Learning Bayesian Networks (part 1) Goals for the lecture

CSCI 2570 Introduction to Nanocomputing

Decision Tree Learning Lecture 2

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CS 350 Algorithms and Complexity

Introduction into Bayesian statistics

CS 350 Algorithms and Complexity

EM (cont.) November 26 th, Carlos Guestrin 1

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Cheng Soon Ong & Christian Walder. Canberra February June 2018

On Computational Limitations of Neural Network Architectures

Probabilistic and Bayesian Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 3: Introduction to Complexity Regularization

CPSC 340: Machine Learning and Data Mining

Transcription:

Learning Objectives At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and test sets c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Supervised Learning Given: a set of inputs features X 1,..., X n a set of target features Y 1,..., Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 2

Supervised Learning Given: a set of inputs features X 1,..., X n a set of target features Y 1,..., Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. classification when the Y i are discrete regression when the Y i are continuous c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 3

Example Data Representations A travel agent wants to predict the preferred length of a trip, which can be from 1 to 6 days. (No input features). Two representations of the same data: Y is the length of trip chosen. Each Y i is an indicator variable that has value 1 if the chosen length is i, and is 0 otherwise. Example Y e 1 1 e 2 6 e 3 6 e 4 2 e 5 1 What is a prediction? Example Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 e 1 1 0 0 0 0 0 e 2 0 0 0 0 0 1 e 3 0 0 0 0 0 1 e 4 0 1 0 0 0 0 e 5 1 0 0 0 0 0 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 4

Evaluating Predictions Suppose we want to make a prediction of a value for a target feature on example e: o e is the observed value of target feature on example e. p e is the predicted value of target feature on example e. The error of the prediction is a measure of how close p e is to o e. There are many possible errors that could be measured. Sometimes p e can be a real number even though o e can only have a few values. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 5

Measures of error E is the set of examples, with single target feature. For e E, o e is observed value and p e is predicted value: absolute error L 1 (E) = e E o e p e c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 6

Measures of error E is the set of examples, with single target feature. For e E, o e is observed value and p e is predicted value: absolute error L 1 (E) = e E o e p e sum of squares error L 2 2(E) = e E (o e p e ) 2 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 7

Measures of error E is the set of examples, with single target feature. For e E, o e is observed value and p e is predicted value: absolute error L 1 (E) = e E o e p e sum of squares error L 2 2(E) = (o e p e ) 2 e E worst-case error : L (E) = max o e p e e E c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 8

Measures of error E is the set of examples, with single target feature. For e E, o e is observed value and p e is predicted value: absolute error L 1 (E) = e E o e p e sum of squares error L 2 2(E) = (o e p e ) 2 e E worst-case error : L (E) = max o e p e e E number wrong : L 0 (E) = #{e : o e p e } c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 9

Measures of error E is the set of examples, with single target feature. For e E, o e is observed value and p e is predicted value: absolute error L 1 (E) = e E o e p e sum of squares error L 2 2(E) = (o e p e ) 2 e E worst-case error : L (E) = max o e p e e E number wrong : L 0 (E) = #{e : o e p e } A cost-based error takes into account costs of errors. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 10

Measures of error (cont.) With binary feature: o e {0, 1}: likelihood of the data pe oe (1 p e ) (1 oe) e E c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 11

Measures of error (cont.) With binary feature: o e {0, 1}: likelihood of the data pe oe (1 p e ) (1 oe) e E log likelihood (o e log p e + (1 o e ) log(1 p e )) e E is negative of number of bits to encode the data given a code based on p e. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 12

Information theory overview A bit is a binary digit. 1 bit can distinguish 2 items k bits can distinguish 2 k items n items can be distinguished using log 2 n bits Can we do better? c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 13

Information and Probability Consider a code to distinguish elements of {a, b, c, d} with P(a) = 1 2, P(b) = 1 4, P(c) = 1 8, P(d) = 1 8 Consider the code: a 0 b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P(a) 1 + P(b) 2 + P(c) 3 + P(d) 3 = 1 2 + 2 4 + 3 8 + 3 8 = 13 4 bits. The string aacabbda has code c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 14

Information and Probability Consider a code to distinguish elements of {a, b, c, d} with P(a) = 1 2, P(b) = 1 4, P(c) = 1 8, P(d) = 1 8 Consider the code: a 0 b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P(a) 1 + P(b) 2 + P(c) 3 + P(d) 3 = 1 2 + 2 4 + 3 8 + 3 8 = 13 4 bits. The string aacabbda has code 00110010101110. The code 0111110010100 represents string c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 15

Information and Probability Consider a code to distinguish elements of {a, b, c, d} with P(a) = 1 2, P(b) = 1 4, P(c) = 1 8, P(d) = 1 8 Consider the code: a 0 b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P(a) 1 + P(b) 2 + P(c) 3 + P(d) 3 = 1 2 + 2 4 + 3 8 + 3 8 = 13 4 bits. The string aacabbda has code 00110010101110. The code 0111110010100 represents string adcabba c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 16

Information Content To identify x, we need log 2 P(x) bits. Give a distribution over a set, to a identify a member, the expected number of bits P(x) log 2 P(x). x is the information content or entropy of the distribution. The expected number of bits it takes to describe a distribution given evidence e: I (e) = x P(x e) log 2 P(x e). c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 17

Information Gain Given a test that can distinguish the cases where α is true from the cases where α is false, the information gain from this test is: I (true) (P(α) I (α) + P( α) I ( α)). I (true) is the expected number of bits needed before the test P(α) I (α) + P( α) I ( α) is the expected number of bits after the test. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 18

Linear Predictions 8 7 L 6 5 4 3 2 1 0 0 1 2 3 4 5 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 19

Linear Predictions 8 7 L 6 L 2 2 5 4 3 2 L 1 1 0 0 1 2 3 4 5 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 20

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 21

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 22

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 23

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 24

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 25

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 26

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 27

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 28

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 29

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 30

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. When Y has values {0, 1}, the prediction that minimizes the entropy on E is c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 31

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. When Y has values {0, 1}, the prediction that minimizes the entropy on E is the empirical probability. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 32

Point Estimates To make a single prediction for feature Y, with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y. The prediction that minimizes the absolute error on E is the median value of Y. The prediction that minimizes the number wrong on E is the mode of Y. The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. When Y has values {0, 1}, the prediction that minimizes the entropy on E is the empirical probability. But that doesn t mean that these predictions minimize the error for future predictions... c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 33

Training and Test Sets To evaluate how well a learner will work on future predictions, we divide the examples into: training examples that are used to train the learner test examples that are used to evaluate the learner...these must be kept separate. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 34

Learning Probabilities Empirical probabilities do not make good predictors of test set when evaluated by likelihood or entropy. Why? c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 35

Learning Probabilities Empirical probabilities do not make good predictors of test set when evaluated by likelihood or entropy. Why? A probability of zero means impossible and has infinite cost if there is one true case in test set. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 36

Learning Probabilities Empirical probabilities do not make good predictors of test set when evaluated by likelihood or entropy. Why? A probability of zero means impossible and has infinite cost if there is one true case in test set. Solution: (Laplace smoothing) add (non-negative) pseudo-counts to the data. Suppose n i is the number of examples with X = v i, and c i is the pseudo-count: P(X = v i ) = c i + n i i c i + n i Pseudo-counts convey prior knowledge. Consider: how much more would I believe v i if I had seen one example with v i true than if I has seen no examples with v i true? c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 37