Advanced Machine Learning & Perception

Similar documents
CHAPTER 10: LINEAR DISCRIMINATION

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

Lecture 11 SVM cont

Variants of Pegasos. December 11, 2009

Solution in semi infinite diffusion couples (error function analysis)

CHAPTER 2: Supervised Learning

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

CHAPTER 5: MULTIVARIATE METHODS

Introduction to Boosting

Clustering with Gaussian Mixtures

Machine Learning Linear Regression

( ) [ ] MAP Decision Rule

An introduction to Support Vector Machine

Clustering (Bishop ch 9)

Machine Learning 2nd Edition

Computing Relevance, Similarity: The Vector Space Model

Robust and Accurate Cancer Classification with Gene Expression Profiling

Lecture 2 L n i e n a e r a M od o e d l e s

CHAPTER 7: CLUSTERING

Lecture VI Regression

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Panel Data Regression Models

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

CHAPTER 10: LINEAR DISCRIMINATION

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Support Vector Machines

Lecture 6: Learning for Control (Generalised Linear Regression)

Department of Economics University of Toronto

Support Vector Machines

( ) () we define the interaction representation by the unitary transformation () = ()

How about the more general "linear" scalar functions of scalars (i.e., a 1st degree polynomial of the following form with a constant term )?

Normal Random Variable and its discriminant functions

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

Let s treat the problem of the response of a system to an applied external force. Again,

F-Tests and Analysis of Variance (ANOVA) in the Simple Linear Regression Model. 1. Introduction

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Nonlinear Classifiers II

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

Volatility Interpolation

PHYS 1443 Section 001 Lecture #4

Online Supplement for Dynamic Multi-Technology. Production-Inventory Problem with Emissions Trading

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

General Weighted Majority, Online Learning as Online Optimization

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

Density Matrix Description of NMR BCMB/CHEM 8190

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Cubic Bezier Homotopy Function for Solving Exponential Equations

Notes on the stability of dynamic systems and the use of Eigen Values.

Density Matrix Description of NMR BCMB/CHEM 8190

A Tour of Modeling Techniques

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Fall 2010 Graduate Course on Dynamic Learning

Lecture 3: Dual problems and Kernels

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

CH.3. COMPATIBILITY EQUATIONS. Continuum Mechanics Course (MMC) - ETSECCPB - UPC

Kristin P. Bennett. Rensselaer Polytechnic Institute

Which Separator? Spring 1

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Mechanics Physics 151

Comparison of Differences between Power Means 1

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

HEAT CONDUCTION PROBLEM IN A TWO-LAYERED HOLLOW CYLINDER BY USING THE GREEN S FUNCTION METHOD

by Lauren DeDieu Advisor: George Chen

FTCS Solution to the Heat Equation

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Machine Learning for Language Technology Lecture 8: Decision Trees and k- Nearest Neighbors

Mechanics Physics 151

Mechanics Physics 151

TSS = SST + SSE An orthogonal partition of the total SS

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines

Name of the Student:

Advanced time-series analysis (University of Lund, Economic History Department)

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Chapter Lagrangian Interpolation

Online Convex Optimization Example And Follow-The-Leader

Consider processes where state transitions are time independent, i.e., System of distinct states,

GMM parameter estimation. Xiaoye Lu CMPS290c Final Project

Statistical Paradigm

Machine Learning 4771

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

Chapter 4. Neural Networks Based on Competition

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

, t 1. Transitions - this one was easy, but in general the hardest part is choosing the which variables are state and control variables

FACIAL IMAGE FEATURE EXTRACTION USING SUPPORT VECTOR MACHINES

Scattering at an Interface: Oblique Incidence

A HIERARCHICAL KALMAN FILTER

Kernel Methods and SVMs Extension

The Analysis of the Thickness-predictive Model Based on the SVM Xiu-ming Zhao1,a,Yan Wang2,band Zhimin Bi3,c

Digital Speech Processing Lecture 20. The Hidden Markov Model (HMM)

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

Linear Classification, SVMs and Nearest Neighbors

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso

NPTEL Project. Econometric Modelling. Module23: Granger Causality Test. Lecture35: Granger Causality Test. Vinod Gupta School of Management

Pattern Classification (III) & Pattern Verification

Chapters 2 Kinematics. Position, Distance, Displacement

Transcription:

Advanced Machne Learnng & Percepon Insrucor: Tony Jebara

SVM Feaure & Kernel Selecon SVM Eensons Feaure Selecon (Flerng and Wrappng) SVM Feaure Selecon SVM Kernel Selecon

SVM Eensons Classfcaon Feaure/Kernel Selecon O O O O O O O O O O Regresson Mea/Mul-Task Learnng O O O O O O O O O O O O O O Transducon/Sem-supervsed????????? O??????? X???? Mul-Class / Srucured + + + + O O O + O O O O O O

Feaure Selecon & Sparsy Isolaes neresng dmensons of daa for a gven ask Reduces compley of daa Augmens Sparse Vecors (SVMs) wh Sparse Dmensons Can also Improve Generalzaon Eample: fnd subse of d feaures from D dms ha gve larges margn SVM? D = s θ L θ +b =1 s 0,1 { } & s = d D =1 Typcally needs eponenal search: 1000 choose 10 f we consder all possble subses of dmensons How o do hs effcenly (and jonly) wh SVM esmaon? Two classcal approaches: Flerng &Wrappng

Feaure Selecon: Flerng Flerng: fnd/elmnae some feaures before even ranng your classfer (before nducon) as a pre-processng. Wrappng: fnd/elmnae some feaures by evaluang her accuracy afer you ran your classfer (afer nducon). Fsher Informaon Creron: Compue score below for each feaure =1 D. Keep he op d feaures Fsher ( ) = µ + = 1 T + µ + µ ( + σ ) 2 + σ 2 σ + + 2 = 1 T + + + µ Lke pung a Gaussan on each class n each 1 dmenson o compue her spread. The Gaussan assumpon may be wrong! Only measures how lnearly separable daa s. 2

Feaure Selecon: Flerng Pearson Correlaon Coeffcens: score how smlar or redundan wo feaures are. Can hen remove redunances or remove feaures ha are oo correlaed on average. Pearson (, j) = µ ( T + 1)σ σ j ( j µ ) j agan Gaussan only Kolmogorov-Smrnov Tes: non-paramerc, more general han Gaussan bu only 1 feaure a a me. For each feaure, compue he cumulave densy funcon over boh classes hen over he sngle class. Fnd KS score as follows, keep op d feaures. KolmogorovSmrnov = T sup q ˆP { q} ˆP { q y = 1}

Feaure Selecon: Flerng Kolmogorov-Smrnov eample: P ( ) 1, { } ˆP ( ) ˆP ( q) P ( y = 1) 1, y = 1 { } ˆP ( y = 1) ˆP ( q y = 1) KS = T sup q ˆP { q} ˆP { q y = 1}

Feaure Selecon: Wrappng Wrappng: use accuracy of resulng classfer o drve he feaure selecon f ( ) = w T φ( s ) +b Do s elemenwse produc of wh bnary vecor s Noe: more feaures usually mproves ranng accuracy. So, pre-specfy he mamum number (or %) of feaures Or, opmze generalzaon bound (SRM vs. ERM) Margn & Radus Bound (lke VC-bound): E { P } err 1 T E R2 = 1 M 2 T E { R2 W 2 ( α) } Beer Span Bound: (f SV s don change when dong leave-one ou cross-valdaon,.e. removng pon p) T 1 E { P } err 1 T E u T α p 1 p=1 ( 1 K ) SV pp Epecaons over daases u() s sep funcon Ksv s Gram mar of only suppor vecors

SVM Feaure Selecon Margn & Radus Bound: opmze va graden descen Assume selecon vecor s s gven: k, ' Compue R 2 and beas va: R 2 = ma β β k, β β ' k, s.. β, ' ' = 1 β 0 Compue W T W and alphas va: ma α Assume swches are connuous, ake dervaves of R 2 /M 2 : R 2 W 2 = R 2 W 2 R 2 = β s W 2 R 2 +W 2 s k (, ), ' = y y ' α α ' α α α ' y y ' k,, ' ' s..α 0,C,, ' β β ' k, ' k (, ' ) = k ( s, ' s) α y = 0

SVM Feaure Selecon Use chan rule o ge graden of kernel wh respec o s. E.g. RBF kernel k (, ' ) = ep 1 2 s.* s '.* s 2 = ep 1 2 s 2 s j ( j) 2 D ( ' ( j j =1 )) = ep 1 2 s 2 j ( j) 2 D j =1 ( ' ( j) ) ep j 2 s s 2 ( ) ' = ep 1 2 s 2 j ( j) 2 D j =1 ( ' ( j) ) j ep 1 s 2 2 ( ' ( ) ) 2 ( ' ( ) ) 2 ( ) 2

SVM Feaure Selecon Assemble all calculaons o ge graden vecor over s R 2 W 2 = β s, ' k, Gven he old s value,, ' β s β ' k (, ' ) = y s y ' α α ' s = 0 1 1 0 R 2 W 2 W 2 R 2 = R 2 +W 2 = 92.4 k, ' 0.4 0.2 3.2 2.4 he graden s: + 25.4 0.3 3.1 3.5 2.3 Take a small sep o drve down he erm (agans graden) T

SVM Feaure Selecon Synheszed from mure of Gaussan daa Feaure selecon mproves classfer & speeds up

SVM Feaure Selecon Real face & pedesran (wavele) daa (only speedup) Wavele bass:

SVM Kernel Selecon We are gven d=1 D base kernels o use n an SVM k 1 (, ' ),k 2 (, ' ),,k D, ' How do we pck he bes ones or a combnaon of hem? k FINAL, ' = k 4 (, ' ) + k 9 (, ' ) + k 12 (, ' ) I we only had o use 1 kernel, ry D dfferen SVMs To pck 5 ou of 10 kernels, need 10 choose 5 = 252 SVMs! Even worse s pckng a weghed combnaon of kernels where he alpha weghs are posve k FINAL, ' D = α k (, ' ) =1 Defne he algnmen beween wo kernel marces as A( K 1,K 2 ) = K 1,K 2 N where K 1,K 2 = k 1, ( j )k 2,,j =1 j K 1,K 1 K 2,K 2

SVM Kernel Selecon We wan a kernel mar K ha algns wh he labels mar ma K A K,yy T Ths can be wren equvalenly as he soluon below: ma K K,yy T s.. K,K = 1,K 0 Ths can all be wren as a semdefne program (SDP) ma K K,yy T s.. A K T 0 0 K I 0 0 0 0 1 r ( A) 0 0 0 0 K Unforunaely, hs can gve a rval soluon 0

SVM Kernel Selecon Insead, force K o be a conc combnaon of base kernels: ma K s.. K,yy T A K T 0 0 K I 0 0 0 0 1 r ( A) 0 0 0 0 K 0 PLUS... K = =1 α K Ths s smpler han an SDP, jus a second order cone program (faser code) D

Feaure vs. Kernel Selecon Lnear feaure selecon can be done va kernel selecon! f va where only a few s values are 1 and mos are zero Defne he base kernels o be: = w T s +b k 1 (, ' ),k 2 (, ' ),,k D, ', ' k = ( ) '( ) K = s K For eample, n a lnear SVM he classfer s: f ( ) = α y k ( FINAL, ) +b = α y s k (, ) +b = α y s ( ) ( ) +b = w T ( s ) +b D =1