From Last Meeting. Studied Fisher Linear Discrimination. - Mathematics. - Point Cloud view. - Likelihood view. - Toy examples

Similar documents
Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Discriminative Direction for Kernel Classifiers

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Machine Learning : Support Vector Machines

Statistical Pattern Recognition

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Dimension Reduction (PCA, ICA, CCA, FLD,

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Kernel-Based Principal Component Analysis (KPCA) and Its Applications. Nonlinear PCA

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning Lecture 7

Similarity and kernels in machine learning

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Support Vector Machine Regression for Volatile Stock Market Prediction


Support Vector Machine. Natural Language Processing Lab lizhonghua

Chemometrics: Classification of spectra

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Linear and Non-Linear Dimensionality Reduction

The Numerical Stability of Kernel Methods

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Issues and Techniques in Pattern Classification

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Polyhedral Computation. Linear Classifiers & the SVM

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Introduction to Support Vector Machines

Soft and hard models. Jiří Militky Computer assisted statistical modeling in the

ML (cont.): SUPPORT VECTOR MACHINES

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Classification: The rest of the story

Neural network time series classification of changes in nuclear power plant processes

Support Vector Machines

Semi-Supervised Learning through Principal Directions Estimation

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Linear & nonlinear classifiers

Analysis Techniques Multivariate Methods

Multivariate statistical methods and data mining in particle physics

Support Vector Machines

2.29 Numerical Fluid Mechanics Spring 2015 Lecture 4

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan

Machine learning for pervasive systems Classification in high-dimensional spaces

Kernel Methods. Barnabás Póczos

Linear Classification and SVM. Dr. Xin Zhang

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Lecture 3: Pattern Classification. Pattern classification

Statistical Properties and Adaptive Tuning of Support Vector Machines

Support Vector Machine via Nonlinear Rescaling Method

Basic Statistical Tools

Support Vector Machine for Classification and Regression

Support Vector Machines Explained

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Support Vector Machines

Support Vector Machine & Its Applications

Linear & nonlinear classifiers

STA 4273H: Sta-s-cal Machine Learning

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

How to learn from very few examples?

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Cluster Kernels for Semi-Supervised Learning

Learning Methods for Linear Detectors

MRC: The Maximum Rejection Classifier for Pattern Detection. With Michael Elad, Renato Keshet

Perceptron Revisited: Linear Separators. Support Vector Machines

An Introduction to Statistical and Probabilistic Linear Models

Kernel Methods. Charles Elkan October 17, 2007

NON-FIXED AND ASYMMETRICAL MARGIN APPROACH TO STOCK MARKET PREDICTION USING SUPPORT VECTOR REGRESSION. Haiqin Yang, Irwin King and Laiwan Chan

Space-Time Kernels. Dr. Jiaqiu Wang, Dr. Tao Cheng James Haworth University College London

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

P-Channels: Robust Multivariate M-Estimation of Large Datasets

Neural networks and support vector machines

Approximate Kernel Methods

Formulation with slack variables

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Small sample size generalization

CPSC 540: Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Smooth Bayesian Kernel Machines

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Kernel Methods. Outline

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Linear Models for Regression

Linear classifiers Lecture 3

Support Vector Machines and Kernel Algorithms

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

COMS 4771 Introduction to Machine Learning. Nakul Verma

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Support Vector Machine (continued)

Support Vector Machines

Statistical Learning Reading Assignments

Computational Learning Theory

Bayesian Support Vector Machines for Feature Ranking and Selection

STA414/2104 Statistical Methods for Machine Learning II

Transcription:

From Last Meeting Studied Fisher Linear Discrimination - Mathematics - Point Cloud view - Likelihood view - Toy eamples - Etensions (e.g. Principal Discriminant Analysis)

Polynomial Embedding Aizerman, Braverman and Rozoner (964) Automation and Remote Control, 5, 82-837. Motivating idea: etend scope of linear discrimination, by adding nonlinear components to data (better use of name nonlinear discrimination????) E.g. In d, linear separation splits the domain Show PolyEmbed/PolyEmbedd.mpg into only 2 parts { : R}

Polynomial Embedding (cont.) But in the quadratic embedded domain {( 2 ) : } 2, R R linear separation can give 3 parts Show PolyEmbed/PolyEmbed2d.mpg - original data space lies in d manifold - very sparse region of 2 R - curvature of manifold gives better linear separation - can have any 2 break points (2 points line)

Polynomial Embedding (cont.) Stronger effects for higher order polynomial embedding: E.g. for cubic, {( 2 3 ) } 3,, : R R Show PolyEmbed/PolyEmbed3d.mpg linear separation can give 4 parts (or fewer) - original space lies in d manifold, even sparser in 3 R - higher d curvature gives improved linear separation - can have any 3 break points (3 points plane)? - relatively few interesting separating planes

Polynomial Embedding (cont.) General View: for original data matri: dn d n! "! add rows :!!!! "!! n n dn d n dn d n 2 2 2 2 2 2

Polynomial Embedding (cont.) Fisher Linear Discrimination: Choose Class for 0 when: ( () (2) ) ( () (2) ) X X X X ( X () X (2) ) w + Σ 0t ˆ w ˆ Σ in embedded space. 2 - image of class boundaries in original space is nonlinear - allows much more complicated class regions - can also do Gaussian Likelihood Ratio (or others)

Polynomial Embedding Toy Eamples E.g. : Parallel Clouds Show PolyEmbed\PEodRaw.ps - PC always bad (finds embedded greatest var. only) show PolyEmbed\PEodPCcombine.pdf - FLD stays good show PolyEmbed\PEodFLDcombine.pdf - GLR OK discrimination at data, but overfitting problems show PolyEmbed\PEodGLRcombine.pdf

Polynomial Embedding Toy Eamples (cont.) E.g. 2: Two Clouds Show PolyEmbed\PEtclRaw.ps - FLD good, generally improves with higher degree show PolyEmbed\PEtclFLDcombine.pdf - GLR mostly good, some overfitting show PolyEmbed\PEtclGLRcombine.pdf 2 2 -, 2,, 2, 2 similar in shape to,??? 2

Polynomial Embedding Toy Eamples (cont.) E.g. 3: Split X Show PolyEmbed\PEed3Raw.ps - FLD rapidly improves with higher degree show PolyEmbed\Ped3FLDcombine.pdf - GLR always good, but never ellipse around blues? show PolyEmbed\Ped3GLRcombine.pdf - Should apply ICA first? Show HDLSS\HDLSSd3ICA.ps

Polynomial Embedding Toy Eamples (cont.) E.g. 4: Split X, parallel to Aes Show PolyEmbed\Ped4Raw.ps - FLD fine with more embedding show PolyEmbed\Ped4FLDcombine.pdf - GLR OK for all, no overfitting. show PolyEmbed\Ped4LGLRcombine.pdf - never found ellipse (maybe hyperbola is right?) - ICA helped FLD (better for lower degree).

Polynomial Embedding Toy Eamples (cont.) E.g. 5: Donut Show PolyEmbed\PEdonRaw.ps - FLD: poor for low degree, then good, no overfit Show PolyEmbed\ PEdonFLDcombine.pdf - GLR: best with no embed, square shape for overfitting? Show PolyEmbed\ PEdonGLRcombine.pdf

Polynomial Embedding Toy Eamples (cont.) E.g. 6: Target Show PolyEmbed\PEtarRaw.ps - Similar lessons Show PolyEmbed\PEtarFLDcombine.pdf, PolyEmbed\PEtarGLRcombine.pdf - Hoped for better performance from cubic

Polynomial Embedding (cont.) Drawback to polynomial embedding: - too many etra terms create spurious structure - i.e. have overfitting - High Dimension Low Sample Size problems worse

Kernel Machines Idea: replace polynomials by other nonlinear functions e.g. : sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to kernel density estimation (smoothed histogram) Show SiZer\EGkdeCombined.pdf

Kernel Machines (cont.) Radial basis functions: at some grid points k g g,...,, For a bandwidth (i.e. standard deviation) σ, Consider (d dim al) functions: ( ) ( ) k g g σ σ ϕ ϕ,..., Replace data matri with: ( ) ( ) ( ) ( ) k n k n g X g X g X g X σ σ σ σ ϕ ϕ ϕ ϕ! "!

Kernel Machines (cont.) For discrimination: work in radial basis function domain, With new data vector X 0 represented by: ϕ ϕ σ σ ( X g ) 0! ( X g ) 0

Kernel Machines (cont.) Toy Eamples: E.g. : Parallel Clouds good at data, poor outside Show PolyEmbed\PEodFLDe7.ps E.g. 2: Two Clouds Similar result Show PolyEmbed\PEtclFLDe7.ps E.g. 3: Split X OK at data, strange outside Show PolyEmbed\Ped3FLDe7.ps

Kernel Machines (cont.) E.g. 4: Split X, parallel to Aes similar ideas Show PolyEmbed\Ped4FLDe7.ps E.g. 5: Donut mostly good (slight mistake for one kernel) Show PolyEmbed\PedonFLDe7.ps E.g. 6: Target much better than other eamples Show PolyEmbed\PetarFLDe7.ps Main lesson: generally good in regions with data, unpredictable results where data are sparse

Kernel Machines (cont.) E.g. 7: Checkerboard Show PolyEmbed\PechbRaw.ps - Kernel embedding is ecellent Show PolyEmbed\PechbFLDe7.ps - Other polynomials lack fleibility Show PolyEmbed\PEchbFLDcombine.pdf and PolyEmbed\PEchbGLRcombine.pdf - Lower degree is worse

Kernel Machines (cont.) Note: Gaussian Likelihood Ratio had frequent numerical failure Important point for kernel machines: High Dimension Low Sample Size problems get worse This is motivation for Support Vector Machines

Kernel Machines (cont.) generalizations of this idea to other types of analysis, and some clever computational ideas. E.g. Kernel based, nonlinear Principal Components Analysis Schölkopf, Smola and Müller (998) Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation, 0, 299-39.

Classical References: Support Vector Machines Vapnik (982) Estimation of dependences based on empirical data, Springer (Russian version, 979) Boser, Guyon & Vapnik (992) in Fifth Annual Workshop on Computational Learning Theory, ACM. Vapnik (995) The nature of statistical learning theory, Springer. Recommended tutorial: Burges (998) A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2, 955-974, see also web site: http://citeseer.nj.nec.com/burges98tutorial.html

Support Vector Machines (cont.) Motivation: High Dimension Low Sample Size discrimination (e.g. from doing a nonlinear embedding) a tendency towards major over-fitting problems Toy Eample: In st dimension: Class : (2,0.8) N 2,0.8 ( n = 20 of each, and threw in 4 outliers ) N Class 2: ( ) In dimensions 2,..., d : independent N(0,) Show Svm\SVMeg3pd2mv.mpg

Support Vector Machines (cont.) Toy Eample: for linear discrimination: Top: Proj n onto (2-d) subspace generated by st unit vector (- -) and Discrimination direction vector (----) (shows angle) For reproducible (over new data sets) discrimination : Want these near each other, i.e. small angle Bottom: -d projections, and smoothed histograms

Support Vector Machines (cont.) Lessons from Fisher Linear Discrimination Toy Eample: - Great angle for d =, but substantial overlap - OK angle for d = 2,..., 0, still significant overlap - Angle gets very bad for d =,..., 8, but overlap declines - No overlap for d 23 (perfect discrimination!?!?) - Completely nonreproducible (with new data) - Thus useless for real discrimination

Support Vector Machines (cont.) Main Goal of Support Vector Machines: Achieve a trade off between: Discrimination quality for data at hand vs. Reproducibility with new data Approaches:. Regularization (bound on generaliz n, via compleity ) 2. Quadratic Programming (general n of Linear Prog.)