A vector from the origin to H, V could be expressed using:

Similar documents
Linear Discrimination Functions

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machine

Linear Discriminant Functions

Machine Learning 2017

IE 5531: Engineering Optimization I

Deep Learning for Computer Vision

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Machine Learning. Support Vector Machines. Manfred Huber

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear & nonlinear classifiers

Constrained Optimization and Lagrangian Duality

ML (cont.): SUPPORT VECTOR MACHINES

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Example - basketball players and jockeys. We will keep practical applicability in mind:

Gradient Descent. Dr. Xiaowei Huang

Machine Learning Lecture 5

Neural Network Training

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS798: Selected topics in Machine Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

IE 5531: Engineering Optimization I

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

CMU-Q Lecture 24:

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Introduction to Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Linear Regression and Its Applications

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Sufficient Conditions for Finite-variable Constrained Minimization

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

11 More Regression; Newton s Method; ROC Curves

Jeff Howbert Introduction to Machine Learning Winter

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Kernelized Perceptron Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Data Mining (Mineria de Dades)

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Max Margin-Classifier

IE 5531: Engineering Optimization I

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Constrained Optimization

Support Vector Machines

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

COMS 4771 Introduction to Machine Learning. Nakul Verma

Single layer NN. Neuron Model

Learning Methods for Linear Detectors

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support vector machines Lecture 4

Ch 4. Linear Models for Classification

1 Solutions to selected problems

SUPPORT VECTOR MACHINE

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Bayes Decision Theory

Support Vector Machines

Linear Algebra (part 1) : Matrices and Systems of Linear Equations (by Evan Dummit, 2016, v. 2.02)

LECTURE NOTE #3 PROF. ALAN YUILLE

Linear classifiers Lecture 3

Chapter 6: Derivative-Based. optimization 1

Statistical Methods for SVM

Homework 4. Convex Optimization /36-725

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Appendix A Taylor Approximations and Definite Matrices

Lecture 5: Logistic Regression. Neural Networks

(Kernels +) Support Vector Machines

1 Matrices and Systems of Linear Equations

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

1/sqrt(B) convergence 1/B convergence B

CPSC 540: Machine Learning

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Statistical Methods for Data Mining

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

MSA220 Statistical Learning for Big Data

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Support Vector Machine (SVM) and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machines.

1 Solutions to selected problems

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Numerical optimization

Transcription:

Linear Discriminant Function: the linear discriminant function: g(x) = w t x + ω 0 x is the point, w is the weight vector, and ω 0 is the bias (t is the transpose). Two Category Case: In the two category case, we have two classiers (salmon and sea bass). We decide whether it belongs to each classier by taking the discriminant function, and assigning points to ω 1 or ω 2, based upon whether g(x) > 0 or g(x) < 0. What this actually boils down to is whether w t x is greater or less than ω 0, or whether the value of the point multiplied by the transpose of the weight vecter is greater or less than the bias. The area where g(x) = 0 is the decision surface, creating two regions (R 1 and R 2 ). If a point that is located on the decision surface H(where w t x = ω 0 ), the point is treated as ambiguous, and not necassarily belonging to either set. If two points are on the decision surface then they both have an equal output (w t x 1 + ω 0 = w t x 0 + ω 0 or, w t (x 1 x 2 ) = 0). g(x) also acts as a measure of the distance from the decision surface. X can be expressed as the following function: x = x p + r(w/ w ) Where x p is the projection of x onto the hyperplane H, and r is the distance desired (positive for membership in R 1, negative for membership in R 2 ). Because x p is projected on H, g(x p ) = 0. As a result, g(x) = r w, in other words: r = g(x)/ w A vector from the origin to H, V could be expressed using: V = ω 0 / w As you can see, w sets the direction of H with respect to the origin, and ω 0 sets the distance of H from the origin. Multicategory Case: one idea is to split each region R i into membership in a set. For example, R i is split it into salmon, trout, and sea bass, with regions for not trout, trout, salmon, not salmon, sea bass, and not seabass, along with an ambigous region. a classier H i separating R i into ω i and!ω i Another approach is to have split them by pairs, so that when trout borders salmon, we split the region into trout and sea bass, but still have the ambiguous region where all of the boundaries meet. In this approach, we will have c(c 1)/2 classiers, one for each pair of subspaces so we get H ij =the linear discriminant between ω i and ω j 1

however both approaches do leave ambiguous regions, so we use the linear machine: in this case we use a discriminant function for each classier, each independent of the other, and each with its own region. This is dierent from the rst approach because now, every point in the region is given membership within a set, there are no more regions of the form!ω i, which removes that large ambiguous region, but still leaves ambiguity when points are located directly on the decision boundaries. It is dierent from the second approach because instead of having c(c 1)/2 discriminants, we now have c discriminants, each of which can possibly be more complex. In this version a point is evaluated for each classier one by one. g i (x) = w t i x + ω i0 i = 1,..., c, ω i = x if g i (x) > g j (x)for all j i the discriminant H ij shows up between two subspaces when g i (x) = g j (x) so 0 = g i (x) g j (x); using the formula for the space g i (x), g i (x) = w t i x + ω i0, we get 0 = (w i w j ) t x + (ω i0 ω j0 ), H ij is normal to w i w j and the distance and direction of x from H ij is (g i (x) g j (x))/ w i w j so: H ij = ((w i w j ) t x + (ω i0 ω j0 ))/ w i w j because of the normalization, it is the dierence between weight vectors, and not the magnitude of any one vector, that becomes important in determining the location of the discriminant. The decision regions are convex, as was seen in the images, and that leaves it with some denitive restrictions, however. The main restriction is that densities need to have a single area with maximum density (unimodal), instead of having multiple points with high density (multimodal). Linear Discriminant Function We can generalize the linear discriminant function as: g(x) = w 0 + d i=1 w ix i Where [w i ] is the weight vector, and d is the dimensionality of x and w. The sum is what we're used to seeing as w T x, and w 0 is the bias term. The surface dened by g(x) = 0 is a hyperplane. Quadratic Discriminant Function We can add a squared term to get the quadratic discriminant function: g(x) = w 0 + d i=1 w ix i + d d i=1 j=1 w ijx i x j Note that we now have a W matrix, [w ij ], in the third term. Because x i x j = x j x i, we let w ij = w ji, so our matrix is symmetric. The surface dened by g(x) = 0 is hyperquadric. Separating Surface More about the separating surface: Scale the matrix to get a new matrix: W = W w T Ww 4w 0 If W = ni for n > 0, then the separating surface is a hypersphere. If W is positive denite, then the separating surface is a hyperellipsoid. If W has both positive and negative eigenvalues, then the separating surface is a hyperhyperboloid. 2

Polynomial Discriminant Function We can continue to add more terms to the discriminant function to get a polynomial of the nth degree. For example, the third term would look like: d i=1 d j=1 d k=1 w ijkx i x j x k This gets unwieldy fairly quickly, so let's dene a vector y that consists of ˆd functions of x, an a general weight vector a. Then we have: g(x) = ˆd i=1 a iy i (x) g(x) = a T y Transformation What happened? We performed a transformation from the d-dimensional x- space to ˆd-dimensional y-space. By picking the y i (x) functions intelligently, we can thus transform any feature space with painfully complex separating surfaces into a y-space where the separating surface is a hyperplane that passes through the origin. Problems Take the quadratic example. We have d(d+1) 2 terms in W, d terms in w, and 1 term in the bias, w 0. Adding all these together gives us (d+1)(d+2) 2 terms in ˆd. If we generalize to the kth power, we have O(d k ) terms. Worse, all of the terms in a need to be learned from training samples. This is simply too much to deal with. Linear Discriminant Function That's not to say this approach is useless. Recall the original linear discriminant function: g(x) = w 0 + d i=1 w ix i Let y = [ 1 x ] and a = [ w 0 w ]. We've dened a very simple mapping from d-dimensional x-space to d + 1- dimensional y-space. But now our decision hyperplane, Ĥ, which could be anywhere in the feature space, passes through the origin. And, in line with section 5.2, the distance from y to Ĥ = at y / a = g(x)/ a. So we no longer need to nd both w 0 and w, but rather a single weight vector a. Two-Category Linear Separable Case Suppose we have n samples y 1, y 2,..., y n, where each sample is either labeled ω 1 or ω 2. Our goal is to use these samples to nd weights in a linear discriminant function. Let the weights be represented by the vector a. Let the discriminant function be g( x ) = a t y. Ideally, we want a single weight vector to classify correctly all the samples. If we can obtain the ideal and nd one such vector then the samples are linearly separable. Given a sample y i, we know it is classied correctly if a t y i > 0 and y i is labeled ω 1 or if a t y i < 0 and y i is labeled ω 2 3

To normalize something, is to make everything normal. That is, everything in question will be the same or similar in some regard. Here we will normalize the samples by taking the negatives of the ones that have a negative sign. This makes the expression a t y i > 0 for all i. Why do this? Having done this we can now ignore the labels and focus more on nding a. None of the essential information has been lost by this transformation. The space prior to the transformation is called the Feature Space and the one after is now called the Weight Space. Separating Vector a.k.a. Solution Vector If there exists a vector a such that for all the normalized y i at y i > 0 then a is the separating vector. The vector a can be looked as a specied point in weight space. All y i place constraints on where that point may be. For all i, a t y i = 0 denes a hyperplane through the origin of the Weight Space. By the denition of normal then, each y i is normal to its corresponding hyperplane. Now, for all y i and each corresponding hyperplane, if a exists then it is on the positive side of every hyperplane. Thus we know, if a solution exists, it must lie in the intersection of the n-half spaces. The region in which it can exist is called the Solution Region. Simplifying nding a Note, the solution vector a is not unique. Here are two ways to constrain the solution space some when nding the solution vector a. 1. Find a unit-length weight vector that maximizes the minimum distance from the samples to the separating plane. 2. Given some positive real number b, nd the minimum length weight vector satisfying a t x > b. b is called the margin. This new region will lie completly inside the old solution region. It is from the old region. b y i. Our motivation for these two methods is to nd a solution vector a more towards the middle of the solution region. Our main concern is that we want to prevent an iterative process trying to nd a from coming up with a solution on the boundary. We want a solution vector closer to the middle based on the idea that it is more likely new data will be classied correctly. Methods for nding a We now have a set of linear inequality equations a t y i > 0 where we want to nd a solution vector a that solves them. We will dene a criterion function J( a ) that is minimized if a is a solution vector. This criterion function will be the means by which we nd the solution vector. The mathematical process we will use to minimize the criterion function J( a ) to nd the solution vector a is called gradient descent. 4

How it works. Take an arbitrary weight vector a(1). Then take the gradient vector of that arbitrary initial weight vector, J( a(1)). Then move some small distance in the direction of greatest descent away from a(1) and this will be a(2). The greatest descent is the negative of the gradient by denition. For the (k+1) step, a(k + 1) is found by: a(k + 1) = a(k) η(k) J( a(k)) η is a positve real number that is the scaling factor or learning rate. It is used to set up the step size. That is how far down the steepest descent from a(k) to a(k + 1) we will go. Setting the Learning Rate η The biggest problem we face using this method for the purpose of nding the soluton vector is that if poorly selected our learning rate can move too slow or it can over shoot and then even possibly diverge. To address this concern with the learning rate we will give a method of setting the learning rate. Start with the assumption that the criterion function J( a) can be approximated well by the second order Taylor series expansion around the value of a(k). J( a) J( a(k)) + J t ( a a(k)) + 1 2 ( a a(k)) t H( a a(k)) H is the Hessian matrix of second partial deriviatives about a(k). Our next step in the derivation for nding a way to compute η eciently is to substitute a(k + 1) into the above equation. This gives J( a(k + 1)) J( a(k)) η(k) J 2 + 1 2 η2 (k) J t H J... Optimal Learning Rate It follows that J( a(k + 1)) is minimized when η(k) = J 2 J t H J This is the formula for the optimized learning rate η. Newton's Algorithm Another method that could be used besides the gradient descent is Newton's algorithm. In this we do not solve for a(k + 1) by taking a small step down the negative gradient with a learning rate, but instead with the inverse Hessian, and thus the Hessian must be nonsingular. It should be noted though that the inversion is of time O(d 3 ). A time which can add up quickly and make 5

gradient descent more applicable. Newton's algorithm works well on quadratic error functions because you are minimizing a(k + 1) a second order expansion by inserting the following in replace of line 3 in Algorithm 1. (Insert Algorithm 2 here) a(k + 1) = a(k) H 1 J 6