B x1. Voronoi Tessellation, Decision Surfaces

Similar documents
B x1. Voronoi Tessellation, Decision Surfaces

CS145: INTRODUCTION TO DATA MINING

Classification: The rest of the story

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Notes on Discriminant Functions and Optimal Classification

Generative v. Discriminative classifiers Intuition

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS145: INTRODUCTION TO DATA MINING

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Lecture 7: DecisionTrees

Linear discriminant functions

CMU-Q Lecture 24:

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Chapter 6: Classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression. Machine Learning Fall 2018

Linear & nonlinear classifiers

Machine Learning & Data Mining

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Jeffrey D. Ullman Stanford University

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Lecture 3: Decision Trees

the tree till a class assignment is reached

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Learning Decision Trees

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Deconstructing Data Science

Lecture 3: Pattern Classification

ECE521 Lecture7. Logistic Regression

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Holdout and Cross-Validation Methods Overfitting Avoidance

Lecture 3: Decision Trees

Final Exam, Machine Learning, Spring 2009

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Learning Decision Trees

Introduction to Machine Learning Midterm Exam

Machine Learning Linear Classification. Prof. Matteo Matteucci

Decision trees COMS 4771

Machine Learning Practice Page 2 of 2 10/28/13

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Logistic Regression. William Cohen

Midterm Exam, Spring 2005

day month year documentname/initials 1

Brief Introduction to Machine Learning

Artificial Neural Networks 2

COMS 4771 Introduction to Machine Learning. Nakul Verma

CSCI567 Machine Learning (Fall 2018)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Decision Tree Learning

Decision Tree Learning Lecture 2

Linear & nonlinear classifiers

CS 6375 Machine Learning

Nonlinear Classification

18.9 SUPPORT VECTOR MACHINES

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Introduction to SVM and RVM

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 7 Decision Tree Classifier

CSC 411 Lecture 3: Decision Trees

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

Learning with multiple models. Boosting.

Statistical Data Mining and Machine Learning Hilary Term 2016

Classification and Regression Trees

Supervised Learning. George Konidaris

Pattern Recognition 2

Machine Learning for Signal Processing Bayes Classification

Data classification (II)

Machine Learning 2nd Edi7on

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Classification and Pattern Recognition

Decision Tree Learning

Pattern Recognition and Machine Learning

Tufts COMP 135: Introduction to Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Machine Learning Spring 2018 Note 18

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Lecture 3: Pattern Classification. Pattern classification

Machine Learning (CSE 446): Neural Networks

Decision Tree Learning

ESS2222. Lecture 4 Linear model

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Artificial Intelligence Roman Barták

Transcription:

Voronoi Tessellation, ecision Surfaces Lecture 2: lassification I Sam Roweis or continuous inputs, we can view the problem as one of segmenting the input space into regions which belong to a single class, i.e. constant output. Such a segmentation is the Voronoi tessellation for our classifier. The boundaries between regions are the decision surfaces. Training a classifier == defining decision surfaces. September 21, 2004 Reminder: lassification Multiple inputs x (can be continuous, discrete or both). Single discrete output y. Goal: predict output on future unseen inputs. Issues: model complexity, overfitting. rom a probabilistic point of view, we are using ayes rule: p(y x) = p(x y)p(y) p(x y)p(y) = p(x) y p(x y )p(y ) Probabilistic Model, ayes rror Rate Model original data as coming from joint pdf p(x, y). lassification == trying to learn conditional density p(y x). ven if we get the perfect model, our error rate may not be zero. Why? lasses may overlap. The best we could ever do if our cost function is number of errors is to guess y = argmax y p(y x). (The error rate of this procedure is known as the ayes error.) p(x ) p( x) p(x ) x bayes error region bayes error region x

inally: a real algorithm! K-Nearest-Neighbour To classify a test point, chose the most common class amongst its K nearest neighbours in the training set. lgorithm K-NN c-test KNN(K,x-train,c-train,x-test) { d(m,n) = distance between x-train(m) and x-test(n) n(n,l) = index of l-th smallest entry of d(:,n) [*] c(n,l) = c-train(n(n,l)) c-test(n) = most common value in c(n,1:k) [**] } If ties at * when l = K, increase K for that n only. If ties at **, decrease K for that n only. confidence (#votes for class) / K rror ounds for NN mazing fact: asymptotically, err(1-nn) < 2 err(ayes): e e 1NN 2e M M 1 e2 this is a tight upper bound, achieved in the zero-information case when the classes have identical densities. or K-NN there are also bounds. e.g. for two classes and odd K: (K1)/2 ( ) k [ e e KNN e i1 i (1 e ) ki e ki (1 e ) i1] i=0 More on K-NN Typical distance = squared uclidean d(m, n) = d (xm d xn d )2 If uclidean distance is used, decision surfaces are piecewise linear. Trick: remember the K th smallest distance so far, and break out of the summation over dimensions if you exceed it. In low-d with lots of training points you can build K trees, ball trees or other data structures to speed up the query time. In high-d, save time by computing the distance of each training point from the min corner and using the annulus bound. xample: USPS igits Take 166 grayscale images (8bit) of handwritten digits. Use uclidean distance in raw pixel space (dumb!) and 7-nn. lassification error (leave-one-out): 4.85%. xample 7 Nearest Neighbours x q

Nonparametric (Instance-ased) Models Q: What are the parameters in K-NN? What is the complexity? : the scalar K and the entire training set. Models which need the entire training set at test time but (hopefully) have very few other parameters are known as nonparametric, instance-based or case based. What if we want a classifier that uses only a small number of parameters at test time? (e.g. for speed or memory reasons) Idea 1: single linear boundary, of arbitrary orientation Idea 2: many boundaries, but axis-parallel & tree structured x 2 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 x 1 t3 t5 t1 t2 t4 isher s Linear iscriminant Observation: If each class has a Gaussian distribution (with same covariances) then the ayes decision boundary is linear: w = Σ 1 (µ 0 µ 1 ) w0 = 1 [ ] 2 w (µ 0 µ 1 ) w log p (µ 0 µ 1 ) 0 log p 1 (µ 0 µ 1 ) Σ 1 (µ 0 µ 1 ) Idea (isher 36): ssume each class is Gaussian even if they aren t! it µ i and Σ as sample mean and sample covariance (shared). This also maximizes the ratio of cross-class scatter to within class scatter: ( z 0 z 1 ) 2 /(var(z 0 ) var(z 1 )) Linear lassification for inary Output Goal: find the line (or hyperplane) which best separates two classes: c(x) = sign[x }{{} w }{{} w 0 ] weight threshold w is a vector perpendicular to decision boundary This is the opposite of non-parametric: only d 1 parameters! Typically we augment x with a constant term ±1 ( bias unit ) and then absorb w 0 into w, so we don t have to treat it specially. 10 igits again Train to discriminant 5 from others. rror = 3.59% w_0 9 8 isher iscriminant for 5 vs not5 x 2 7 6 5 1 2 3 4 5 6 7 8 9 0 4 3 2 1 1 2 3 4 5 6 7 8 9 10 x 1

Linear iscriminants are Perceptrons The architecture we are using c(x) = sign[x w w 0 ] x3 xi xj can be thought of as a circuit/network. It was studied extensively in the 1960s and is known as a perceptron. There is another way to train the weights, other than isher. lgorithm perceptrontrain (Rosenblatt 56) w perceptrontrain(x-train,c-train) { w = small random values; do { errors=0; for n=1:n {if(c-train(n)!= sign[w *xtrain(n)]) then { w = w c - train(n)*xtrain(n); errors; } } } until(errors==0) } y w Tree Structured xis-ligned lassifiers What if we want more than two regions? We could consider a fixed number of arbitrary linear segments but even cheaper is to use axis-aligned splits. If these form a hierarchical partition, then the classifier is called a decision tree or classification tree. ach internal node tests one attribute; leaves assign a class. quivalent to a disjunction of conjunctions of constraints on attribute values (if-then rules). t3 t5 t1 t2 t4 <t5 <t3 <t1 <t2 <t4 Perceptron Learning Rules Now: cycle through examples, when you make an error, add/subtract the example from the weight vector depending on its true class. mazingly, for separable training sets, this always converges. (We absorb the threshold as a bias variable always equal to -1.) or non-separable datasets, you need to remember the sets of weights which you have seen so far, and combine them somehow. One way: keep the set that survived unchanged for the longest number of (random) pattern presentations. (Gallant s pocket algorithm.) etter way: reund & Shapire s voted perceptron algorithm. Remember all sets and the length of time they survived. Perceptron, voted-perceptron, weighted-majority, kernel perceptron, Winnow, and other algorithms have a frumpy reputation but they are actually extremely powerful and useful, especially using the kernel trick. Try these before more complex classifiers such as SVMs! ost unction for ecision Trees efine a measure of class impurity in a set of examples. Push each example down the tree, how pure are leaves? Goal: minimize expected sum of impurity at leaves at test time. Two problems: 1) We don t know true distribution p(x, y). 2) Search: even if we knew p(x, y) finding optimal tree is NP. So we will take a suboptimal (greedy) approach.............

Learning (Inducing) ecision Trees Need to pick the order of split axes and values of split points. Many algorithms: RT, I3, 4.5, 5.0. lmost all have the following structure: 1. Put all examples into the root node. 2. t each node: search all dimensions, on each one chose split which most reduces impurity; chose the best split. 3. Sort the data cases into the daughter nodes based on the split. 4. Recurse until a leaf condition: number of examples at node is too small all examples at node have same class all examples at node have same inputs 5. Prune tree down to some maximum number of leaves. (Possibly using a different impurity measure than for growing.) Restrict to inary Splits better solution is to always constrain ourselves to binary splits. or ordered discrete or real valued nodes, split is natural. lso easy to compute. or a discrete attribute with M settings, looks like we need to consider 2 M 1 splits. ut for two classes, there is a trick: 1. Order the settings according to p(c x i = m). 2. Search exhaustively over q, grouping first q and last M q. 3. Optimal split is one of those. <t5 <t3 <t1 <t2 <t4 Impurity Measures When considering splitting data at a node on x i, we measure: Gain(; x i ) = I() iv I( iv) v split(x i ) ommon impurity measures: ntropy: I() = c p c() log p c () (two classes) Misclass: I() = 1 p c Gini: I() = c c c p c()p c () = c p c()(1 p c ()) (Gini is the avgerage error if we stochastically classify with node prior) These often favour multi-way splits. One solution: normalize by split information : S() = iv log iv v 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error 0.0 0.2 0.4 0.6 0.8 1.0 p ntropy Real Valued ttributes or real valued attributes, what splits should we consider? Idea1: discretize the real value into M bins. Idea2: Search for a scalar value to split on. Sounds hard! Lots of real values. ut there is a trick: Only need to consider splits at midpoints between observed values. In fact, only need to consider splits at midpoints between observed values with different classes. omplexity: N log N 2N x x x x x x x

lgorithm: T root of decision tree = SplitNode(train-data,nmin) subtree SplitNode() { c = most common class in if (all class() same) or (all x() same) or (size() < nmin) then return a leaf of class c else for each xi measure Gain(;xi) return a node which splits on best xi and has daughters: - SplitNode(iv) for all split vals v with nonempty iv - leaf of class c for values with empty iv } G Gain(,i) { G = I() for each value v in split(xi) iv = cases in with xi=v G = G - I(iv)*size(iv)/size() } Pruning ecision Trees inding the optimal pruned tree. It can be shown that if you start with a tree T 0 and insist on using a rooted subtree of it, the following sequence of trees contains the optimum tree for all numbers of leaves: 1. Let U(node) = I(node)-I(subtree-rooted-at-node) 2. Replace the non-leaf node with the smallest value of: U(node)/leaves-below-node with a leaf node having majority class. ven after pruning, decision trees still have problems: - cannot capture additive structure (OR), for this MRS is better - cannot deal with linear combinations of variables Overfitting in Trees Just as with most other models, decision trees can overfit. In fact they are quite powerful. eg: xpressive power of binary trees Q: If all input and outputs are binary, what class of oolean functions can Ts represent? : ll oolean functions. Hence we must regularize to control capacity. Typically we do this by limiting the number of leaf nodes. ormally, we define: Φ(T ) = leaves I(l) α leaves. Minimizing this for any α is equivalent to finding the tree of a fixed size with smallest impurity. (cf. Lagrange multipliers). Practically, we achieve this via pruning. Often we use Gini/ntropy to grow tree and Misclass to prune it. T Variants I3 (Quinlan) - split values are all possible values of x i - I() is entropy - no pruning 4.5, 5.0 (Quinlan) - binary splits - I() is entropy - error-pruning - rule simplification RT (reiman et. al) - binary splits - I() is Gini - minimum-leaf subtree pruning

Open questions... How do we chose K in K-NN? How do we chose T max for decision trees? an isher s iscriminant overfit? What about nearest-neighbour or tree-based regression models? Still to come: Logistic regression, Neural Nets for lassification, lass- onditional Models (Gaussian and Naive ayes)