STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Similar documents
Lecture 6: Methods for high-dimensional problems

Classification 2: Linear discriminant analysis (continued); logistic regression

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Regularized Discriminant Analysis and Reduced-Rank LDA

Linear Methods for Prediction

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Dimension Reduction (PCA, ICA, CCA, FLD,

MSA200/TMS041 Multivariate Analysis

A Statistical Analysis of Fukunaga Koontz Transform

Classification 1: Linear regression of indicators, linear discriminant analysis

5. Discriminant analysis

Computational methods for mixed models

The Bayes classifier

Learning with Singular Vectors

Machine Learning 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Motivating the Covariance Matrix

CMSC858P Supervised Learning Methods

Classification Methods II: Linear and Quadratic Discrimminant Analysis

L11: Pattern recognition principles

A Least Squares Formulation for Canonical Correlation Analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

10-725/36-725: Convex Optimization Prerequisite Topics

Linear Regression and Discrimination

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Machine Learning (CS 567) Lecture 5

Linear Regression (9/11/13)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Discriminant analysis and supervised classification

Gaussian Models

Classification: Linear Discriminant Analysis

Linear Methods for Prediction

Lecture 5: Classification

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Bayesian Decision Theory

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Computation. For QDA we need to calculate: Lets first consider the case that

Fisher Linear Discriminant Analysis

Maximum variance formulation

Bayesian Decision and Bayesian Learning

Tutorial on Principal Component Analysis

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Minimum Error Rate Classification

STAT 100C: Linear models

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Functional SVD for Big Data

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

STA 414/2104, Spring 2014, Practice Problem Set #1

Machine Learning - MT Classification: Generative Models

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Introduction to Machine Learning

Classification. Sandro Cumani. Politecnico di Torino

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

LECTURE NOTE #10 PROF. ALAN YUILLE

CS 195-5: Machine Learning Problem Set 1

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Feature Engineering, Model Evaluations

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

ECE521 week 3: 23/26 January 2017

Extensions to LDA and multinomial regression

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Linear Classifiers as Pattern Detectors

DS-GA 1002 Lecture notes 10 November 23, Linear models

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Midterm exam CS 189/289, Fall 2015

Convex Optimization M2

Statistical Machine Learning

6-1. Canonical Correlation Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Homework 2: Solutions

ECE 661: Homework 10 Fall 2014

Introduction to Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Least Squares Regression

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Data Mining Techniques

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

A Robust Approach to Regularized Discriminant Analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to Machine Learning

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

Lecture 8: Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification

12 Discriminant Analysis

Hypothesis Testing hypothesis testing approach

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

ECE521 lecture 4: 19 January Optimization, MLE, regularization

High Dimensional Discriminant Analysis

Jianhua Z. Huang, Haipeng Shen, Andreas Buja

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Transcription:

STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010

Classification Given K classes in R p, represented as densities f i (x), 1 i K classify x R p. In other words, partition R p (or other sample space) into subsets Π i, 1 i k based on the densities f i (x). Maximum likelihood rule x Π i i = argmax f j (x) j

Example: multinomial Suppose the sample space is all p-tuples of integers that sum to n. Two classes f 1 = Multinom(n, α), f 2 = Multinom(n, β). ML rule boils down to x Π 1 p i=1 x i log α i β i > 0 The function h 12 (x) = p i=1 x i log α i β i is called a discriminant function between classes 1 & 2.

Discriminant functions ML rule can be summarized as x Π i h ij (x) > 0 j where h ij (x) = log f i(x) f j (x).

Bayesian rule If prior class probabilities (π 1,..., π K ) are available, a more sensible rule is x Π i i = argmax π j f j (x). j Modified discriminant functions: h ij (x) = h ij (x) + log π i π j

Example: Gaussian in R Let f 1 = N(µ 1, σ 2 1 ), f 2 = N(µ 2, σ 2 2 ). Discriminant function: h 12 (x) = x 2 ( 1 2 σ1 2 1 σ2 2 Note: h 12 is quadratic, unless σ 1 = σ 2. ) ( µ1 x σ1 2 µ ) 2 σ2 2 + 1 ( ) µ 2 2 2 σ2 2 µ2 1 σ1 2 +log σ 2 σ 1 LDA (Linear Discriminant Analysis): σ 1 = σ 2 QDA (Quadratic Discriminant Analysis): σ 1 σ 2.

Example: Gaussian in R p In general, ML rule classifies x by minimizing Mahalanobis distance x Π i i = argmin Σj (x, µ j ) + log det(σ i ), j after adjusting for Σ i. If Σ i = Σ for all i, the ML (LDA) rule classifies by minimizing Mahalanobis distance. Bayesian rule (with Σ i = Σ) classifies by x Π i i = argmin Σ (x, µ j ) 2 log π i j

Sample ML and Bayesian rules For each class, estimate ( µ i, Σ i, π i ) with π i = n i /n. QDA: classify according to x Π i i = argmin bσj (x, µ j ) + log det( Σ j ) 2 log π j. j

Sample ML and Bayesian rules LDA: estimate pooled covariance matrix Σ = 1 n K and classify according to K (n i 1) Σ i i=1 x Π i i = argmin bσ (x, µ j ) 2 log π j. j

Gaussian in R p : Σ i = Σ Suppose that p > K, and let L = µ 1 span{µ i µ 1, 2 i K} It is clear that all of the action happens along affine subspace L R p of dimension at most (K 1). Suggests we should reduce dimension...

Fisher s linear discriminant Assumption: Σ i = Σ Given a data matrix X n p and labels L l, 1 l n consider a linear combination Y (v) n 1 = X v. The SSE of Y (v) can be decomposed as n (Y (v) l Ȳ (v)) 2 = l=1 k n i (Ȳ (v) i Ȳ (v)) 2 i=1 + k n i (Y (v) ij Ȳ (v) i ) 2 i=1 j=1 = v ΣB v + v ΣW v

Fisher s linear discriminant Fisher s suggestion: choose v = argmax v ΣB v v:v Σ b W v=1 i.e. maximize between groups variance subject to within group variance of 1. Leads to generalized eigenvalue problem Σ B v = λ Σ W v Can construct up to K 1 different directions (subject v i Σ W v j = δ ij ).

Fisher s linear discriminant Define the Fisher discriminant scores V j = X v j, 1 j K 1 to form a new data matrix V n (K 1). The pooled covariance matrix of the V i s will be I so LDA is just classifying to the nearest centroid V i = mean of V in class i.

Fisher s linear discriminant & CCA Consider the indicators Y li = 1 {Ll =i}, 1 l n, 1 i K 1 Putting the data matrices (Y, X ) through CCA yields K 1 pairs ( α i, β i ), 1 i K 1 of canonical directions. It turns out that β i = v i (up to a scalar multiple)...

Reducing the rank Suppose some of hte µ i s are collinear dim(l) < K 1 Then, some of the Fisher scores will have little information. We can discard some of the scores and then classify according to the nearest centroid of reduced space.

QDA revisited Fisher s linear discriminant functions are dimension reduction tools. In olive data, groups have unequal variance suggests we could use QDA on fisher scores. Note: this is not the same as QDA on whole vector unless the noise orthogonal to L has the same covariance...

QDA revisited LDA produces boundaries that are linear in X. If we transform X R p to f (X ) R q, LDA on h(x ) will produce boundaries that are linear in the components of h(x ). Suppose p = 2, take f (x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). This will produce discriminant functions h ij (x) = a ij,1 x 1 + a ij,2 x 2 + a ij,3 x 2 1 + a ij,4 x 2 2 + a ij,5 x 1 x 1 + c ij Cheap way to get quadratic boundaries.

More general expansions Why limit ourselves to quadratic? We could take a large basis f (x) = (f 1 (x),..., f m (x)) and perform LDA on f (X ) with labels L. If we take all K 1 fisher scores, the number of coefficients we need to estimate is (K 1) m this grows quickly.

Penalized discriminant analysis Recall that Fisher s scores were constructed as max(v ΣB v) s.t. v ΣW v = 1 To regularize, we can insist instead that v ( Σ W + λω)v = 1 If Ω penalizes rough functions, this will produce smoother decision boundaries as λ grows...

Penalized discriminant analysis Generalized eigenproblem: Σ B v = λ( Σ W + λω)v. with Σ B, Σ W estimated covariance matrices of derived variables f (X ) n m. Using the scores from this eigenproblem and classifying by nearest centroid corresponds to this rule: x Π i i = argmin bσw +λω (h(x), µ j,f ) 2 log π j j with µ j,f the sample mean of f (X ) in class j.

Flexible discriminant analysis The previous penalized approach suggests the following strategy: 1. Find good scores... 2. Use nearest centroid classification on the scores... How do we find good scores?

Flexible discriminant analysis Connection with CCA: let Y n K be the matrix of indicators for the classes based on labels L n 1. Fisher s directions are (parallel to) canonical directions for X ( α, β) = argmax Ĉor(α Y, β X ) α,β = argmax α,β 1 n 1 (Y α) (X β) subject to Var(α Y ) = α Σ Y α = Var(β X ) = β ΣX β = 1, Ê(α Y ) = Ê(β X ) = 0.

Flexible discriminant analysis Under these constraints 1 n 1 (Y 1 α) (X β) = 1 2(n 1) Y α X β 2. Fixing α and maximizing Ĉor(α Y, β X ) is a regression of X onto Y α.

n STATS306B Flexible discriminant analysis (Unpenalized) problem is recast as min θ, β n i=1 (θ(l i ) X i β) 2 with l i the i-th label, and X i the i-th row of X Subject to constraint n i=1 θ(l i) = 0, n i=1 θ(l i) 2 = 1. (Reexpression of Ê(Y α) = 0, Var(α Y ) = 1.) As in CCA, we obtain successive pairs ( θ l, β l ) solving this problem... Inner loop can be replaced with a more flexible model...

Flexible discriminant analysis (FDA): algorithm (Ch. 12 ESL) 1. Let ŶY = η (X ) be a linear regression estimator of E(Y ), i.e. ŶY is n K with i-th row η (X i ). 2. Let C p p = ŶY ŶY. 3. Let Θ be the eigenvectors of C normalized so that Θ DΘ where D = ( π 1,..., π K ). [Maximization over α]. 4. Define η(x) = Θ η (x). [Update output of regression to give optimal scores from above] 5. Compute η(x ) n K and centroids η i,..., η K. 6. Classify a new observation based on η(x) to nearest centroid η i.

Flexible discriminant analysis (FDA) 1. FDA tries to minimize L λ (Θ, η) = 1 2 Tr((Y Θ η(x )) (Y Θ η(x ))) where η = η(x, Y, Θ, λ) is a multivariate regression method. 2. More precisely, we could minimize L λ (Θ, β) = 1 2 Tr((Y Θ X β) (Y Θ X β)) + λp(β) for some penalty β. 3. Examples 3.1 LASSO: P(β) = p i=1 k j=1 β ij

Steps of the alternating algorithm 1. Choose some initial Θ 0 such that Θ 0 (Y Y )Θ 0 = ni k k. 2. For Θ fixed, define η = η(x, Y Θ, λ) to be the output of the regression method, when X is regressed onto Y Θ. That is, 3. For (η, Θ) fixed, define Û = η : R p R k Û(Y Θ, η(x )) = argmin Tr((Y ΘU η(x )) (Y ΘU η(x ))). U:U U=I

Procrustes problem The problem Û = Û(Y Θ, η(x )) = argmin Tr((Y ΘU η(x )) (Y ΘU η(x ))). U:U U=I is called a Procrustes problem. The matrix Û can be obtained via an SVD of Y Θ η(x ). Let Y Θ η(x ) = U 1 DU 2. Then, Û = U 1 U 2. Note, if Y Θ η(x ) is symmetric, then Û = I and D are its eigenvalues Y Θ η(x ). These singular values are used as weights for the different optimal scores.

Alternating algorithm for FDA Choose some initial Θ 0 such that Θ 0 Y Y Θ 0 = ni. For i 1, until convergence is reached based on L λ (Θ, η) 1. Find η i = η(x, Y Θ i, λ) 2. Compute Y Θ i η i(x ) and find its SVD: U 1 Di U 2. 3. Update Θ i+1 = Θ i U 1 U 2.

Alternating algorithm for FDA This will converge as long as each of the steps of finding η i and Θ i+1 decreases the loss. Use η to compute class centroids ( η j ) 1 j k. Classify using nearest centroids with weights 1/( D (1 D )).

Digits examples First example: P(β) = Tr(β (0) Lβ (0)) and β (0) is β without intercept term, β 0. The penalty L is the discrete Laplacian of the 16 16 lattice. Defined as diag(rowsum(a)) A where A 256 256 is the adjacency matrix of the lattice. Second example: P(β) = β

Digits: ridge with discrete Laplacian

Digits: ridge with discrete Laplacian

Digits: ridge with discrete Laplacian

Digits: ridge with discrete Laplacian

Digits: ridge with LASSO

Digits: ridge with LASSO

Digits: ridge with LASSO

Digits: ridge with discrete Laplacian