Machine Learning 2nd Edition

Similar documents
Linear Dimensionality Reduction

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Machine Learning 2nd Edition

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 7: Con3nuous Latent Variable Models

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

L11: Pattern recognition principles

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Pattern Recognition 2

Statistical Pattern Recognition

STA 414/2104: Lecture 8

Machine Learning (Spring 2012) Principal Component Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

7. Variable extraction and dimensionality reduction

Chapter 4: Factor Analysis

Dimension Reduction (PCA, ICA, CCA, FLD,

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Introduction to Machine Learning

Principal Component Analysis and Linear Discriminant Analysis

Machine Learning 11. week

Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis

Covariance and Correlation Matrix

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Unsupervised Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Components Analysis (PCA)

Maximum variance formulation

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

PCA, Kernel PCA, ICA

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

CS281 Section 4: Factor Analysis and PCA

Introduction to Graphical Models

Dimensionality Reduction

Machine Learning for Software Engineering

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Deriving Principal Component Analysis (PCA)

Machine Learning - MT & 14. PCA and MDS

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

1 Principal Components Analysis

Advanced Introduction to Machine Learning CMU-10715

Table of Contents. Multivariate methods. Introduction II. Introduction I

STA 414/2104: Lecture 8

Principal Component Analysis

Data Informatics. Seon Ho Kim, Ph.D.

STATISTICAL LEARNING SYSTEMS

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Classification 2: Linear discriminant analysis (continued); logistic regression

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

STA 414/2104: Machine Learning

Pattern Recognition and Machine Learning

PCA and LDA. Man-Wai MAK

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

PCA and LDA. Man-Wai MAK

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

PRINCIPAL COMPONENT ANALYSIS

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Linear & nonlinear classifiers

What is Principal Component Analysis?

Machine detection of emotions: Feature Selection

PCA and admixture models

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Gaussian Models

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Principal Component Analysis (PCA)

Generative Clustering, Topic Modeling, & Bayesian Inference

20 Unsupervised Learning and Principal Components Analysis (PCA)

EECS490: Digital Image Processing. Lecture #26

Issues and Techniques in Pattern Classification

Chris Bishop s PRML Ch. 8: Graphical Models

Dimensionality Reduction Techniques (DRT)

Vector Space Models. wine_spectral.r

Correlation Preserving Unsupervised Discretization. Outline

ECE 661: Homework 10 Fall 2014

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Eigenvalues, Eigenvectors, and an Intro to PCA

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Unsupervised Learning: K- Means & PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Machine Learning 2nd Edi7on

Dimensionality reduction

PCA FACE RECOGNITION

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Feature selection and extraction Spectral domain quality estimation Alternatives

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

LEC 3: Fisher Discriminant Analysis (FDA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Motivating the Covariance Matrix

Transcription:

INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Outline Previous class Multivariate Data Parameter Estimation Estimation of Missing Values Multivariate Classification This class: Ch 5: Multivariate Methods Discrete Features Multivariate Regression Ch 6: Dimensionality reduction Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0)

CHAPTER 5: Multivariate Methods

Model Selection 4 Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Discrete Features Binary features: p ij p xj=1 Ci if xj are independent (Naive Bayes ) the discriminant is linear g i ( x) = logp ( x C ) + logp ( C ) = i ( ) ( ) x ( ) ( x = ) j i p j 1 C ij 1 pij p x j= 1 [ x logp + ( 1 x ) log( p )] + logp ( C ) j ij j 1 j Estimated parameters d i ij = ij t x r Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) pˆ t t j r t i t i i 5

Multivariate Regression t ( t ) w w w + ε r,,..., = g x 0 1 d Multivariate linear model w 0 + w x 1 t 1 + w x 2 t 2 + + w x t d ( ) [ t t t E w, w,..., w X = r w w x w x ] 2 0 1 d 1 d 2 t 0 1 1 d d Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 6

Multivariate Regression l ( ) [ t t t, w,..., w X = r w w x w x ] 2 E w 0 1 w d 0 + w x 1 t 1 + w x 2 t 2 + + w x t d 1 0 1 1 2 t d d d Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 7

CHAPTER 6: Dimensionality Reduction

Dimensionality of input 9 Number of Observables (e.g. age and income) If number of observables is increased More time to compute More memory to store inputs and intermediate results More complicated explanations (knowledge from learning) Regression from 100 vs. 2 parameters No simple visualization 2D vs. 10D graph Need much more data (curse of dimensionality) 1M of 1-d inputs is not equal to 1 input of dimension 1M

Dimensionality reduction 10 Some features (dimensions) bear little or nor useful information (e.g. color of hair for a car selection) Can drop some features Have to estimate which features can be dropped from data Several features can be combined together without loss or even with gain of information (e.g. income of all family members for loan application) Some features can be combined together Have to estimate which features to combine from data

Feature Selection vs Extraction 11 Feature selection: Choosing k<d important features, ignoring the remaining d k Subset selection algorithms Feature extraction: Project the original xi, i =1,...,d dimensions to new k<d dimensions, z j, j =1,...,k Principal Components Analysis (PCA) Factor Analysis (FA) Linear Discriminant Analysis (LDA) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Usage 12 Have data of dimension d Reduce dimensionality to k<d Discard unimportant features Combine several features in one Use resulting k-dimensional data set for Learning for classification problem (e.g. parameters of probabilities P(x C) Learning for regression problem (e.g. parameters for model y=g(x ) Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Goodness of feature set 13 Supervised Train using selected subset Estimate error on validation data set Unsupervised Look at input only(e.g. age, income and savings) Select subset of 2 that bear most of the information about the person Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Mutual Information 14 Have a 3 random variables(features) X,Y,Z and have to select 2 which gives most information If X and Y are correlated then much of the information about of Y is already in X Make sense to select features which are uncorrelated Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Subset Selection There are 2^d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature j = argmini E ( F xi ) Add xj to F if E ( F xj ) < E ( F ) O(d^2) algorithm does not guarantee optimal Backward search: Start with all features and remove one at a time, if possible. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 15

Subset Selection Backward search: Start with all features and remove one at a time, if possible. Set of features F At each iteration, remove a feature that does not decrease the error j = argmini E ( F xi ) Remove xj to F if E ( F xj ) < E ( F ) O(d^2) algorithm does not guarantee optimal Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 16

Subset-selection (recap) 17 Forward search Start from empty set of features Try each of remaining features Estimate classification/regression error for adding specific feature Select feature that gives maximum improvement in validation error Stop when no significant improvement Backward search Start with original set of size d Drop features with smallest impact on error Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Feature Extraction 18 Face recognition problem Training data input: pairs of Image + Label(name) Classifier input: Image Classifier output: Label(Name) Image: Matrix of 256X256=65536 values in range 0..256 Each pixels bear little information so can t select 100 best ones Average of pixels around specific positions may give an indication about an eye color. Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Projection 19 Find a projection matrix w from d-dimensional to k-dimensional vectors that keeps error low Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA: Motivation 20 Assume that d observables are linear combination of k<d vectors We would like to work with basis as it has lesser dimension and have all(almost) required information What we expect from such basis Uncorrelated or otherwise can be reduced further Have large variance or otherwise bear no information

PCA: Motivation 21 Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA: Motivation 22 Choose directions such that a total variance of data will be maximum Maximize Total Variance Choose directions that are orthogonal Minimize correlation Choose k<d orthogonal directions which maximize total variance Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA 23 Choosing only directions: Maximize variance subject to a constrain using Lagrange Multipliers Taking Derivatives Eigenvector. Since want to maximize we should choose an eigenvector with largest eigenvalue Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA 24 d-dimensional feature space d by d symmetric covariance matrix estimated from samples Select k largest eigenvalue of the covariance matrix and associated k eigenvectors The first eigenvector will be a direction with largest variance Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

What PCA does 25 z = W T (x m) where the columns of W are the eigenvectors of, and m is sample mean Centers the data at the origin and rotates the axes Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

How to choose k? 26 Proportion of Variance (PoV) explained λ 1 λ + λ 1 2 + λ2 + + + λ k + λk + + λ d when λ i are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at elbow Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

27 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

28 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA 29 Can take into account classes : Karhuned-Loeve Expansion Estimate Covariance Per Class Take average weighted by prior Common Principle Components Assume all classes have same eigenvectors (directions) but different variances Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA 30 PCA is unsupervised (does not take into account class information) Does not try to explain noise Large noise can become new dimension/largest PC Interested in resulting uncorrelated variables which explain large portion of total sample variance Sometimes interested in explained shared variance (common factors) that affect data

Factor Analysis 31 Assume set of unobservable ( latent ) variables Goal: Characterize dependency among observables using latent variables Suppose group of variables having large correlation among themselves and small correlation with other variables Single factor? Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Factor Analysis 32 Assume k input factors (latent unobservable) variables generating d observables Assume all variations in observable variables are due to latent or noise (with unknown variance) Find transformation from unobservable to observables which explain the data Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Factor 33 Analysis Find a small number of factors z, which when combined generate x : x i µ i = v i1 z 1 + v i2 z 2 +... + v ik z k + ε i where z j, j =1,...,k are the latent factors with E[ z j ]=0, Var(z j )=1, Cov(z i,, z j )=0, i j, ε i are the noise sources E[ ε i ]= ψ i, Cov(ε i, ε j ) =0, i j, Cov(ε i, z j ) =0, and v ij are the factor loadings Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

PCA vs FA PCA From x to z z = WT(x µ) FA From z to x x µ = Vz + ε x z z x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 34

Factor Analysis 35 In FA, factors zj are stretched, rotated and translated to generate x Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

FA Usage 36 Speech is a function of position of small number of articulators (lungs, lips, tongue) Factor analysis: go from signal space (4000 points for 500ms ) to articulation space (20 points) Classify speech (assign text label) by 20 points Speech Compression: send 20 values

Linear Discriminant Analysis 37 Find a low-dimensional space such that when x is projected, classes are well-separated Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Means and Scatter after projection 38 Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Good Projection 39 Means are far away as possible Scatter is small as possible Fisher Linear Discriminant J ( w ) = ( m m ) 2 1 2 s 2 + s 2 1 2 Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 40

Summary 41 Feature selection Supervised: drop features which don t introduce large errors (validation set) Unsupervised: keep only uncorrelated features (drop features that don t add much information) Feature extraction Linearly combine feature into smaller set of features Supervised PCA: explain most of the total variability FA: explain most of the common variability Unsupervised LDA: best separate class instances Based on E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)