Machine learning for pervasive systems Classification in high-dimensional spaces

Similar documents
DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Clustering VS Classification

Multivariate statistical methods and data mining in particle physics

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Dimension Reduction (PCA, ICA, CCA, FLD,

Pattern Recognition and Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Multivariate Statistical Analysis

Data Mining Techniques

5. Discriminant analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

MLCC 2015 Dimensionality Reduction and PCA

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

EECS 275 Matrix Computation

What is Principal Component Analysis?

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis

PCA, Kernel PCA, ICA

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS47300: Web Information Search and Management

Discriminative Direction for Kernel Classifiers

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Notes on Latent Semantic Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Probabilistic Latent Semantic Analysis

c 4, < y 2, 1 0, otherwise,

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Statistical Machine Learning

Course in Data Science

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

6.036 midterm review. Wednesday, March 18, 15

Introduction to SVM and RVM

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Lecture: Face Recognition and Feature Reduction

Classifier Complexity and Support Vector Classifiers

Linear & nonlinear classifiers

STATISTICAL LEARNING SYSTEMS

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Intelligent Data Analysis Lecture Notes on Document Mining

Dimensionality Reduction

Max Margin-Classifier

Kernel Methods. Machine Learning A W VO

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning 2nd Edition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning. Support Vector Machines. Manfred Huber

Principal Component Analysis

PCA and admixture models

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

CS281 Section 4: Factor Analysis and PCA

CS534 Machine Learning - Spring Final Exam

Matching the dimensionality of maps with that of the data

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

ECE 661: Homework 10 Fall 2014

Linear & nonlinear classifiers

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Foundations of Computer Vision

Support Vector Machines

Support Vector Machines

Unsupervised Learning

Introduction to Support Vector Machines

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Pattern Classification

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Support Vector Machines

Linear and Non-Linear Dimensionality Reduction

Recent Advances in Bayesian Inference Techniques

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Machine Learning Lecture 7

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis


Computation. For QDA we need to calculate: Lets first consider the case that

CSC 411 Lecture 12: Principal Component Analysis

A few applications of the SVD

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Kernel Principal Component Analysis

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Unsupervised Learning: Dimensionality Reduction

Machine Learning Lecture 2

Department of Computer Science and Engineering

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ch 4. Linear Models for Classification

Transcription:

Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version 1.1,

Lecture overview 02.11.2016 Introduction 09.11.2016 Decision Trees 16.11.2016 Regression 23.11.2016 Training Principles, Performance Evaluation 30.11.2016 PCA, LSI and SVM 07.12.2016 Instance-based learning 14.12.2016 Artificial Neural Networks 21.12.2016 No lecture (ELEC christmas party) 11.01.2017 Probabilistic Graphical Models 18.01.2017 Unsupervised learning 25.01.2017 Anomaly detection, online-learning, recommender systems 01.02.2017 Trend mining and Topic Models 08.02.2017 Feature selection 15.02.2017 Presentations 2 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 3 / 32

Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

Principle Component Analysis Principal Component Analysis Find lower dimensional surface onto which to project the data 4 / 32

Principle Component Analysis PCA finds k vectors v (1),..., v (k) onto which to project the data such that the projection error is minimized. 5 / 32

Principle Component Analysis PCA finds k vectors v (1),..., v (k) onto which to project the data such that the projection error is minimized. In particular, we find values z (i) to represent the x (i) in this k-dimensional vector space spanned by the v (i) 5 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) Covariance A measure of spread of a set of points around their center of mass 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : When a matrix C is C = 1 multiplied n (x (i)) with a (x (i)) vector T u, this usually results in a new vector Cu m of different direction than u. i=1 }{{}}{{} 1 m-dim. m 1-dim. }{{} m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : When a matrix C is C = 1 multiplied n (x (i)) with a (x (i)) vector T u, this usually results in a new vector Cu m of different direction than u. i=1 }{{}}{{} There are few vectors u, however, which have the same 1 m-dim. m 1-dim. direction (Cu = λu). }{{} m m-dim. These are the eigenvectors of C and λ are the eigenvalues (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) Eigenvectors and Eigenvalues The (orthogonal) eigenvectors are sorted by their eigenvalues with respect to the direction of greatest variance in the data. 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 m n (x (i)) }{{} (x (i)) T }{{} i=1 1 m-dim. m 1-dim. } {{ } m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 3 Choose the k eigenvectors with largest eigenvalues to represent the projection space U 6 / 32

Principle Component Analysis 1 Compute the covariance matrix from the x (i) : C = 1 n (x (i)) (x (i)) T m i=1 }{{}}{{} 1 m-dim. m 1-dim. }{{} m m-dim. (We assume that all features are mean-normalized and scaled into [0, 1]) 2 The pricipal components are found by computing the eigenvectors and eigenvalues of C (solving (C λi m)u = 0) 3 Choose the k eigenvectors with largest eigenvalues to represent the projection space U 4 These k eigenvectors in U are used to transform the inputs x i to z i : z (i) = U T x (i) 6 / 32

Principle Component Analysis How to choose the number k of dimensions? We can calculate Average squared projection error Total variation in the data m i=1 x (i) x (i) m i=1 x (i) 2 1 m approx 2 as the accuracy of the projection using k principle components as a function of the eigenvalues k i=1 λi m j=1 λj = d 7 / 32

Principle Component Analysis How to choose the number k of dimensions? We can calculate Average squared projection error Total variation in the data m i=1 x (i) x (i) m i=1 x (i) 2 1 m approx 2 as the accuracy of the projection using k principle components as a function of the eigenvalues k i=1 λi m = d j=1 λj We say that 100 (1 d)% of variance is retained. (Typically, d [0.01, 0.05] ) 7 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 8 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query 9 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: 9 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: 9 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: These matrices are typically huge but sparse. 9 / 32

Latent Semantic Indexing Motivation In information retrieval, a common task is to obtain from many documents that subset which best matches a query Typical feature representations of documents are then term-document matrices: These matrices are typically huge but sparse. How to identify those feature dimensions (or combinations thereof) which are most meaningful? 9 / 32

Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j 10 / 32

Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j 10 / 32

Latent Semantic Indexing Singular Value Decomposition Any m n matrix C can be represented as a singular value decomposition in the form C = UΣV T where U m m matrix; columns are the orthog. eigenvectors of CC T V n n matrix; columns are the orthog. eigenvectors of C T C Σ Diagonal Matrix with Σ ii = λ i ; Σ ij = 0, i j First k eigenvectors map document vectors to lower dimensional representation It can be shown that this mapping resuls in the k-dim. space with smallest distance to the original space 10 / 32

Latent Semantic Indexing Example U: Σ: V T : 11 / 32

Latent Semantic Indexing Example U: Σ: V T : 11 / 32

Latent Semantic Indexing Example Σ: C 2 : Cosine-similarity 12 / 32

13 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 14 / 32

For our previous classifier, we have designed an objective function of sufficient dimension 15 / 32

For our previous classifier, we have designed an objective function of sufficient dimension Alternative to designing complex non-linear functions: Change dimension of input space so that linear separator is again possible 15 / 32

16 / 32

SVM pre-processes data to represent patterns in a high dimension Dimension often much higher than original feature space Then, insert hyperplane in order to separate the data 17 / 32

The goal for support vector machines is to find a separating hyperplane with the largest margin to the outer points in all sets If no such hyperplane exists, map all points into a higher dimensional space until such a plane exists 18 / 32

Simple application to several classes by iterative approach: One-versus-all belongs to class 1 or not? belongs to class 2 or not?... Search for optimum mapping between input space and feature space complicated (no optimum approach known) 19 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 20 / 32

Cost function Contribution of a single sample to the overall cost: 21 / 32

Cost function Contribution of a single sample to the overall cost: Logistic regression 1 y log 1 + e W T x ( (1 y) log 1 1 1 + e W T x ) 21 / 32

Cost function Contribution of a single sample to the overall cost: Logistic regression SVM 1 y log 1 + e W T x ( (1 y) log 1 1 1 + e W T x y cost y=1 (W T x) + (1 y) cost y=0 (W T x) ) 21 / 32

Cost function min W min W. 1 [ m ( 1 m i=1 y i log C m i=1 ) 1 + (1 y 1+e W T x i ) i Logistic regression ( log ( 1 ))] 1 + λ n 1+e W T x i 2m j=1 w j 2 SVM [ yi cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) ] + 1 n 2 j=1 w j 2 1 C here plays a similar role as 1 λ 22 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 23 / 32

SVM hypothesis min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 0 < 0 sufficient min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary large C stricter boundary at the cost of smaller margin min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

SVM hypothesis W T x { 1 1 confidence Outliers: Elastic decision boundary small C tolerates outliers min W C m i=1 [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 2 n j=1 w 2 j 24 / 32

Large margin classifier min C m [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 W 2 i=1 n j=1 w 2 j 25 / 32

Large margin classifier min C m [ ] y i cost y=1 (W T x i ) + (1 y i )cost y=0 (W T x i ) + 1 W 2 i=1 n j=1 w 2 j Rewrite the SVM optimisation problem as min W 1 n 2 j=1 w j 2 s.t. W T x i 1 if y i = 1 W T x i 1 if y i = 0 25 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 = W p 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 W T x = w 1 x 1 + w 2 x 2 = W p 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 min 1 2 W 2 and W p i 1 necessitate larger p i 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 min 1 2 W 2 and W p i 1 necessitate larger p i 26 / 32

Large margin classifier min W 1 ( ) 2 n 2 j=1 w j 2 = 1 2 w1 2 + + w n 2 = 1 W 2 2 s.t. W T x i 1 if y i = 1 W p i 1 W T x i 1 if y i = 0 W p i 1 Which decision boundaray is found? h(x) = w 1 x 1 + w 2 x 2 W orthogonal to all x with h(x) = 0 min 1 2 W 2 and W p i 1 necessitate larger p i 26 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 27 / 32

Kernels Non linear decision boundary 28 / 32

Kernels Non linear decision boundary Hypothesis = 1 if w 0 +w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 +w 4 x 2 1 + 0 28 / 32

Kernels Non linear decision boundary Hypothesis = 1 if w 0 +w 1 x 1 +w 2 x 2 +w 3 x 1 x 2 +w 4 x 2 1 + 0 w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Kernel Define kernel via landmarks 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Kernel Define kernel via landmarks 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) σ controls the width of the Gaussian Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels Non linear decision boundary w 0 + w 1 k 1 + w 2 k 2 + w 3 k 3 +... Gaussian: k(x, l i ) = e x l i 2 2σ 2 x l i k(x, l i ) 1 (towards 0 else) σ controls the width of the Gaussian Example: w 0 = 0.5, w 1 = 1, w 2 = 1, w 3 = 0 h(x) = w 0 + w 1 k(x, l 1 ) + w 2 k(x, l 2 ) + w 3 k(x, l 3 ) σ = 1 σ = 0.5 σ = 2 28 / 32

Kernels placement of landmarks Possible choice of initial landmarks: All training-set samples Training of w i f i = k(x i, l 1 ). k(x i, l m ) min C m y i cost yi =1(W T f i ) + (1 y i ) cost yi =0(W T f i ) + 1 W 2 i=1 m j=1 w 2 j 29 / 32

Outline Principle Component Analysis Latent Semantic Indexing Support Vector Machines Cost function Hypothesis Kernels 30 / 32

Questions? stephan.sigg@aalto.fi Le Nguyen le.ngu.nguyen@aalto.fi 31 / 32

Literature C.M. Bishop: Pattern recognition and machine learning, Springer, 2007. R.O. Duda, P.E. Hart, D.G. Stork: Pattern Classification, Wiley, 2001. 32 / 32