Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Similar documents
Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Advanced Algorithms and Models for Computational Biology

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

The Bayes classifier

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Classical AI and ML research ignored this phenomena Another example

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

L11: Pattern recognition principles

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Announcements. Proposals graded

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Machine Learning 2017

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Supervised Learning: Non-parametric Estimation

Introduction to Machine Learning

Machine Learning for Signal Processing Bayes Classification

Machine Learning (CS 567) Lecture 5

Bayesian Models in Machine Learning

Basics on Probability. Jingrui He 09/11/2007

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

ECE521 week 3: 23/26 January 2017

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Deep Learning for Computer Vision

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Metric-based classifiers. Nuno Vasconcelos UCSD

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Week 1 Quantitative Analysis of Financial Markets Distributions A

BAYESIAN DECISION THEORY

Statistical methods in recognition. Why is classification a problem?

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 1: Bayesian Framework Basics

Mining Classification Knowledge

Recitation 2: Probability

Bayesian Learning. Bayesian Learning Criteria

Bayesian Decision and Bayesian Learning

Lecture : Probabilistic Machine Learning

Lecture 3: Pattern Classification

Bayesian Decision Theory

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Mining Classification Knowledge

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Probability Models for Bayesian Recognition

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning for Signal Processing Bayes Classification and Regression

A first model of learning

Bayes Decision Theory

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

The random variable 1

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

probability of k samples out of J fall in R.

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

MLE/MAP + Naïve Bayes

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 3: Statistical Decision Theory (Part II)

Nearest Neighbor Pattern Classification

Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

MLE/MAP + Naïve Bayes

Machine Learning

Machine Learning

Maximum likelihood estimation

Expect Values and Probability Density Functions

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Machine Learning

Introduction to Probability and Statistics (Continued)

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

Statistical learning. Chapter 20, Sections 1 4 1

Bayesian Decision Theory

Bayesian Decision Theory

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Consistency of Nearest Neighbor Methods

Review of probabilities

5. Discriminant analysis

The Naïve Bayes Classifier. Machine Learning Fall 2017

An Introduction to Machine Learning

COMP90051 Statistical Machine Learning

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

2. A Basic Statistical Toolbox

CSC411 Fall 2018 Homework 5

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning

Transcription:

Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically the best classifier Bayesian decision rule for Minimum Error Nonparametric Classifier (Instance-based learning) Nonparametric density estimation K-nearest-neighbor classifier Optimality of knn 1

Classification Representing data: Hypothesis (classifier) Decision-making as dividing a high-dimensional space Distributions of samples from normal and abnormal machine 2

Basic Probability Concepts A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.) E.g., S may be the set of all possible outcomes of a dice roll: S { 1,2,3,4,5,6} E.g., S may be the set of all possible nucleotides of a DNA site: S { A,T, C,G} E.g., S may be the set of all possible positions time-space positions o of a aircraft on a radar screen: S { 0, R } { 0, 360 } { 0, max + } Random Variable A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary from trial to trial as the experiment is repeated) S X(ω) Discrete r.v.: The outcome of a dice-roll ω = [x 1, x 2,, x n ] T The outcome of reading a nt at site i: Binary event and indicator variable: Seeing an "A" at a site X=1, o/w X=0. This describes the true or false outcome a random event. Can we describe richer outcomes in the same way? (i.e., X=1, 2, 3, 4, for being A, C, G, T) --- think about what would happen if we take expectation of X. Unit-Base Random vector X i =[X ia, X it, X ig, X ic ] ', X =[0,0,1,0]' i seeing a "G" at site i Continuous r.v.: The outcome of recording the true location of an aircraft: The outcome of observing the measured location of an aircraft X i X true X obs 3

Discrete Prob. Distribution (In the discrete case), a probability distribution P on S (and hence on the domain of X ) is an assignment of a non-negative real number P(s) to each s S (or each valid value of x) such that Σ s S P(s)=1. (0 P(s) 1) intuitively, P(s) corresponds to the frequency (or the likelihood) of getting s in the experiments, if repeated many times call θ s = P(s) the parameters in a discrete probability distribution A probability distribution on a sample space is sometimes called a probability model, in particular if several different distributions are under consideration write models as M 1, M 2, probabilities as P(X M 1 ), P(X M 2 ) e.g., M 1 may be the appropriate prob. dist. if X is from "fair dice", M 2 is for the "loaded dice". M is usually a two-tuple of {dist. family, dist. parameters} Discrete Distributions Bernoulli distribution: Ber(p) 1 p for x = 0 P x ) = p for x = 1 x 1 x ( Multinomial distribution: Mult(1,θ) Multinomial (indicator) variable: X1 X2 X 3 X =, where X 4 X 5 X6 p( x ( j )) = P 1 X = [ 0, 1], and X = 1 w.p. θ, j j P ( x ) = p (1 p) j j j [1,...,6] j j [1,...,6] X = 1 θ = 1. ({ X j =, where j index the dice - face} ) xa xc xg xt xk j = θa θc θg θt = θ k = θ = θ k x 4

Discrete Distributions Multinomial distribution: Mult(n,θ) Count variable: x1 X = M, x K where x j = n j n! n! K p( x ) = θ x! x! Lx! 1 x1 x2 x θ1 θ2 LθK = 2! LxK! x1! x2 K x Continuous Prob. Distribution A continuous random variable X can assume any value in an interval on the real line or in a region in a high dimensional space A random vector X=[x 1, x 2,, x n ] T usually corresponds to a real-valued measurements of some property, e.g., length, position, It is not possible to talk about the probability of the random variable assuming a particular value --- P(x) = 0 Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval [ x 1 x ] P ( X, 2 ), ( X < x ) = P ( X [ x ]) P, Arbitrary Boolean combination of basic propositions 5

Continuous Prob. Distribution The probability of the random variable assuming a value within some given interval from x 1 to x 2 is defined to be the area under the graph of the probability density function between x 1 and x 2. x2 ( [ ]) p( x ) dx, Probability mass: P X x1, x2 = x1 note that + p( x ) dx = 1. Cumulative distribution function (CDF): P ( x ) = P x ( X < x ) = p( x ') dx ' Probability density function (PDF): p ( x ) = + d dx P ( x ) p( x ) dx = 1; p( x ) > 0, x Car flow on Liberty Bridge (cooked up!) Continuous Distributions Uniform Probability Density Function p( x ) = 1/( b a ) for a x b = 0 elsewhere Normal (Gaussian) Probability Density Function f(x) p( x ) = The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, µ (mean) and σ (standard deviation), determine the location and shape of the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The mean can be any numerical value: negative, zero, or positive. Multivariate Gaussian 1 2 2 ( x µ ) / 2σ e 2πσ x µ r p( X ; µ, Σ) = 1 n ( 2π ) / 2 Σ 1/ 2 1 exp 2 r T 1 ( X µ ) Σ ( X µ ) r 6

7 Class-Conditional Probability Classification-specific Dist.: P(X Y) Class prior (i.e., "weight"): P(Y) ), ; ( ) ( 1 1 1 1 Σ = = µ r X p Y X p ), ; ( ) ( 2 2 2 2 Σ = = µ r X p Y X p The Bayes Rule What we have just did leads to the following general expression: This is Bayes Rule ) ( ) ( ) ( ) ( X P Y p Y X P X Y P =

The Bayes Decision Rule for Minimum Error The a posteriori probability of a sample p( X Y = i) P( Y = i) π i pi ( X ) P( Y = i X ) = = qi ( X ) p( X ) π i pi ( X ) Bayes Test: i Likelihood Ratio: l(x ) = Discriminant function: h(x ) = Example of Decision Rules When each class is a normal We can write the decision boundary analytically in some cases homework!! 8

Bayes Error We must calculate the probability of error the probability that a sample is assigned to the wrong class Given a datum X, what is the risk? The Bayes error (the expected risk): More on Bayes Error Bayes error is the lower bound of probability of classification error Bayes classifier is the theoretically best classifier that minimize probability of classification error Computing Bayes error is in general a very complex problem. Why? Density estimation: Integrating density function: 9

Learning Classifier The decision rule: Learning strategies Generative Learning Parametric Nonparametric Discriminative Learning Instance-base (Store all past experience in memory) A special case of nonparametric classifier Supervised Learning K-Nearest-Neighbor Classifier: where the h(x) is represented by all the data, and by an algorithm 10

Recall: Vector Space Representation Each document is a vector, one component for each term (= word). Doc 1 Doc 2 Doc 3... Word 1 3 0 0... Word 2 0 8 1... Word 3 12 1 10...... 0 1 3...... 0 0 0... Normalize to unit length. High-dimensional vector space: Terms are axes, 10,000+ dimensions, or even 100,000+ Docs are vectors in this space Classes in a Vector Space Sports Science Arts 11

Test Document =? Sports Science Arts K-Nearest Neighbor (knn) classifier Sports Science Arts 12

knn Is Close to Optimal Cover and Hart 1967 Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data] In particular, asymptotic error rate is 0 if Bayes rate is 0. Decision boundary: Where does knn come from? How to estimation p(x)? Nonparametric density estimation Parzen density estimate E.g.: More generally: knn density estimate 13

Voting knn The procedure Sports Science Arts Asymptotic Analysis Condition risk: r k (X,X NN ) Test sample X NN sample X NN Denote the event X is class I as X I Assuming k=1 When an infinite number of samples is available, X NN will be so close to X 14

Asymptotic Analysis, cont. Recall conditional Bayes risk: Thus the asymptotic condition risk This is called the MacLaurin series expansion It can be shown that This is remarkable, considering that the procedure does not use any information about the underlying distributions and only the class of the single nearest neighbor determines the outcome of the decision. In fact Example: 15

knn is an instance of Instance-Based Learning What makes an Instance-Based Learner? A distance metric How many nearby neighbors to look at? A weighting function (optional) How to relate to the local points? Euclidean Distance Metric L 1 norm: x-x' L norm: max x-x' (elementwise ) Mahalanobis: where Σ is full, and symmetric Correlation Angle Hamming distance, Manhattan distance 2 D( x, x') = σ ( x i i i x ') Or equivalently, T D( x, x') = ( x x') Σ( x x') Other metrics: i 2 16

1-Nearest Neighbor (knn) classifier Sports Science Arts 2-Nearest Neighbor (knn) classifier Sports Science Arts 17

3-Nearest Neighbor (knn) classifier Sports Science Arts 5-Nearest Neighbor (knn) classifier Sports Science Arts 18

Nearest-Neighbor Learning Algorithm Learning is just storing the representations of the training examples in D. Testing instance x: Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Does not explicitly compute a generalization or category prototypes. Also called: Case-based learning Memory-based learning Lazy learning Is knn ideal? more later 19

Summary Bayes classifier is the best classifierwhich minimizes the probability of classification error. Nonparametric and parametric classifier A nonparametric classifier does not rely on any assumption concerning the structure of the underlying density function. A classifier becomes the Bayes classifier if the density estimates converge to the true densities when an infinite number of samples are used The resulting error is the Bayes error, the smallest achievable error given the underlying distributions. 20