T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Similar documents
Principal Components

Lecture 10, Principal Component Analysis

Chapter 3: Cluster Analysis

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Pattern Recognition 2014 Support Vector Machines

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Computational modeling techniques

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

24 Multiple Eigenvectors; Latent Factor Analysis; Nearest Neighbors

Math 302 Learning Objectives

IAML: Support Vector Machines

Chapter 3 Kinematics in Two Dimensions; Vectors

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Versatility of Singular Value Decomposition (SVD) January 7, 2015

Math Foundations 20 Work Plan

Support-Vector Machines

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Smoothing, penalized least squares and splines

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

The blessing of dimensionality for kernel methods

COMP 551 Applied Machine Learning Lecture 4: Linear classification

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

A Matrix Representation of Panel Data

Tree Structured Classifier

Distributions, spatial statistics and a Bayesian perspective

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Checking the resolved resonance region in EXFOR database

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Public Key Cryptography. Tim van der Horst & Kent Seamons

SPH3U1 Lesson 06 Kinematics

5 th grade Common Core Standards

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Physical Layer: Outline

Homology groups of disks with holes

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Chapter 9 Vector Differential Calculus, Grad, Div, Curl

, which yields. where z1. and z2

Kinetic Model Completeness

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

AP Statistics Notes Unit Two: The Normal Distributions

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

1 The limitations of Hartree Fock approximation

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

ECEN 4872/5827 Lecture Notes

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

NOTE ON THE ANALYSIS OF A RANDOMIZED BLOCK DESIGN. Junjiro Ogawa University of North Carolina

Lecture 3: Principal Components Analysis (PCA)

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

The standards are taught in the following sequence.

Trigonometry, 8th ed; Lial, Hornsby, Schneider

A Comparison of Methods for Computing the Eigenvalues and Eigenvectors of a Real Symmetric Matrix. By Paul A. White and Robert R.

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Rigid Body Dynamics (continued)

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Linear Algebra Methods for Data Mining

IN a recent article, Geary [1972] discussed the merit of taking first differences

Cambridge Assessment International Education Cambridge Ordinary Level. Published

AIP Logic Chapter 4 Notes

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

AP Physics Kinematic Wrap Up

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

GRAPH EFFECTIVE RESISTANCE AND DISTRIBUTED CONTROL: SPECTRAL PROPERTIES AND APPLICATIONS

V. Balakrishnan and S. Boyd. (To Appear in Systems and Control Letters, 1992) Abstract

CHM112 Lab Graphing with Excel Grading Rubric

Simple Linear Regression (single variable)

Experiment #3. Graphing with Excel

Preparation work for A2 Mathematics [2017]

Hypothesis Tests for One Population Mean

Matter Content from State Frameworks and Other State Documents

ENSC Discrete Time Systems. Project Outline. Semester

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Physics 2010 Motion with Constant Acceleration Experiment 1

NWACC Dept of Mathematics Dept Final Exam Review for Trig - Part 3 Trigonometry, 9th Edition; Lial, Hornsby, Schneider Fall 2008

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Engineering Decision Methods

Computational modeling techniques

Equilibrium of Stress

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

CMSC 425: Lecture 9 Basics of Skeletal Animation and Kinematics

Margin Distribution and Learning Algorithms

Floating Point Method for Solving Transportation. Problems with Additional Constraints

Surface and Contact Stress

Assessment Primer: Writing Instructional Objectives

A Correlation of. to the. South Carolina Academic Standards for Mathematics Precalculus

Multiple Source Multiple. using Network Coding

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

Technical Bulletin. Generation Interconnection Procedures. Revisions to Cluster 4, Phase 1 Study Methodology

Example 1. A robot has a mass of 60 kg. How much does that robot weigh sitting on the earth at sea level? Given: m. Find: Relationships: W

Transcription:

T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin

reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss, Dasgupta and Gupta Database-friendly randm prjectins: Jhnsn- Lindenstrauss with binary cins, Achliptas Randm prjectin in dimensinality reductin: Applicatins t image and text data, Bingham and Mannila T-61.5060 -- slide set 6: dimensinality reductin 2

the curse f dimensinality the efficiency f many algrithms depends n the number f dimensins d distance / similarity cmputatins are at least linear t the number f dimensins index structures fail as the dimensinality f the data increases data in large dimensins is difficult t visualize T-61.5060 -- slide set 6: dimensinality reductin 3

what if we were able t......reduce the dimensinality f the data, while maintaining the meaningfulness f the data? T-61.5060 -- slide set 6: dimensinality reductin 4

dimensinality reductin cnsider dataset X cnsisting f n pints in a d- dimensinal space d data pint x in X is a vectr in R data can be seen as an n x d matrix X = 0 B @ x 11... x 1d...... x n1... x nd 1 C A dimensinality-reductin methds: dimensin selectin: chse a subset f the existing dimensins dimensin cmpsitin: create new dimensins by cmbining existing nes T-61.5060 -- slide set 6: dimensinality reductin 5

dimensinality reductin dimensinality-reductin methds: dimensin selectin: chse a subset f the existing dimensins dimensin cmpsitin: create new dimensins by cmbining existing nes bth methdlgies map each vectr x in R d t a vectr y in R k mapping: A : R d R k fr the idea t be useful we want: k<<d 6 T-61.5060 -- slide set 6: dimensinality reductin

linear dimensinality reductin dimensinality-reductin mapping: A : R d R k assume that A is a linear mapping it can be seen as a matrix (d x k) y = x A s Y = X A bjective: Y shuld be as clse as pssible t X T-61.5060 -- slide set 6: dimensinality reductin 7

clseness: pairwise distances Jhnsn-Lindenstrauss lemma: cnsider dataset X f n pints in R d, and ɛ>0 then there exists k=o(ɛ -2 lgn) and a linear mapping A : R d R k, such that fr all x and z in X (1-ɛ) x-z 2 (d/k) xa-za 2 (1+ɛ) x-z 2 T-61.5060 -- slide set 6: dimensinality reductin 8

clseness: pairwise distances Jhnsn-Lindenstrauss lemma: cnsider dataset X f n pints in R d, and ɛ>0 then there exists k=o(ɛ -2 lgn) and a linear mapping A : R d R k, such that fr all x and z in X (1-ɛ) x-z 2 (d/k) xa-za 2 (1+ɛ) x-z 2 what is the intuitive interpretatin f this statement? T-61.5060 -- slide set 6: dimensinality reductin 8

Jhnsn-Lindenstrauss lemma: intuitin each vectr x in X is prjected nt a k-dimensinal vectr y = xa dimensin f the prjected space is k=o(ɛ -2 lgn) sq. distance x-z 2 is apprximated by (d/k) xa-za 2 intuitin: expected sq. nrm f a prjectin f a unit vectr nt a randm subspace is k/d the prbability that it deviates frm its expectatin is very small T-61.5060 -- slide set 6: dimensinality reductin 9

the randm prjectins each vectr x in X is prjected nt a k-dimensinal vectr y = xa randm prjectins are represented by a linear transfrmatin matrix A y = x A what is the matrix A? T-61.5060 -- slide set 6: dimensinality reductin 10

the randm prjectins the elements A(i,j) f A can be drawn frm the nrmal distributin N(0,1) resulting rws f A define randm directins in R d anther way t define A is ([Achliptas 2003]) A(i, j) = 8 < : 1 with prb. 1/6 0 with prb. 2/3 1 with prb. 1/6 why is this useful? all zer-mean, unit-variance distributins fr A(i,j) wuld give a mapping that satisfies the Jhnsn-Lindenstrauss lemma T-61.5060 -- slide set 6: dimensinality reductin 11

datasets as matrices cnsider dataset in the frm f an n x d matrix X n bjects as rws, d dimensins as features X(i,j) represents the imprtance f feature j fr bject i gal: understand the structure f the data, e.g., the underlying prcess that generates the data reduce the number f features representing the data T-61.5060 -- slide set 6: dimensinality reductin 12

mtivating examples find a subset f prducts that characterize custmers find a subset f grups that characterize users f a scial netwrk find a subset f terms that accurately clusters dcuments T-61.5060 -- slide set 6: dimensinality reductin 13

principal cmpnent analysis idea: lk fr a directin that the data prjected nt it has maximal variance when fund, cntinue by seeking the next directin, which is rthgnal t this (i.e., uncrrelated), and which explains as much f the remaining variance in the data as pssible thus, we are seeking linear cmbinatins f the riginal variables if we are lucky, we can find a few such linear cmbinatins, r directins, r (principal) cmpnents, which describe the data accurately the aim is t capture the intrinsic variability in the data T-61.5060 -- slide set 6: dimensinality reductin 14

principal cmpnent analysis T-61.5060 -- slide set 6: dimensinality reductin 15

principal cmpnent analysis 1st principal cmpnent T-61.5060 -- slide set 6: dimensinality reductin 15

principal cmpnent analysis 2nd principal cmpnent 1st principal cmpnent T-61.5060 -- slide set 6: dimensinality reductin 15

principal cmpnent analysis cnsider X t be the n x d data matrix assume that X is zer centered (each clumn sums t 0) let w define the prjectin we are lking (a d x 1 vectr; we require w T w = 1) prjectin f the data n w maximizes the variance prjectin f a data pint x n w is x w prjectin f data X n w is Xw T-61.5060 -- slide set 6: dimensinality reductin 16

zer-centered data 0 T-61.5060 -- slide set 6: dimensinality reductin 17

zer-centered data 0 T-61.5060 -- slide set 6: dimensinality reductin 17

zer-centered data 0 T-61.5060 -- slide set 6: dimensinality reductin 17

zer-centered data 0 T-61.5060 -- slide set 6: dimensinality reductin 17

principal cmpnent analysis prjectin f data X n w is Xw variance: Var(w) = (Xw) T (Xw) = w T X T Xw = w T Cw where C = X T X is the cvariance matrix f the data maximize w T Cw subject t the cnstraint w T w=1 maximize f = w T Cw - λ(w T w-1) where λ is the Lagrange multiplier T-61.5060 -- slide set 6: dimensinality reductin 18

principal cmpnent analysis ptimizatin prblem: maximize f = w T Cw - λ(w T w-1) differentiating with respect t w gives 2Cw - 2λw = 0 eigenvalue equatin Cw = λw, where C = X T X but eigenvalues f C are the singular values f X T-61.5060 -- slide set 6: dimensinality reductin 19

recall: singular value decmpsitin (SVD) every n x d matrix X can be decmpsed in the frm X = U Σ V T where U is an rthgnal matrix cntaining the left singular vectrs f X V is an rthgnal matrix cntaining the right singular vectrs f X Σ is a diagnal matrix cntaining the singular values f X (σ 1 σ 2 ) extremely useful tl fr analyzing data T-61.5060 -- slide set 6: dimensinality reductin 20

significant nise singular value decmpsitin X = U Σ V T dimensins sig. significant bjects = nise nise X k = U k Σ k V k T is the best rank-k apprximatin f X T-61.5060 -- slide set 6: dimensinality reductin 21

principal cmpnent analysis we shwed that the principal cmpnents are the singular values f X in particular: i-th principal cmpnent f X is the i-th right singular vectr f X the variance n the i-th principal cmpnent is exactly the i-th singular value squared ( σ i 2 ) rule f thumb: cnsider k principal cmpnents s that yu capture abut 85% f the variance f the riginal data (can be estimated using the singular values) T-61.5060 -- slide set 6: dimensinality reductin 22

principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem T-61.5060 -- slide set 6: dimensinality reductin 23

principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem why? T-61.5060 -- slide set 6: dimensinality reductin 23

principal cmpnent analysis what we saw s far: PCA is SVD n centered data hw t nt cmpute PCA: center the data t get X frm C = X T X and slve eigen-prblem why? hw t cmpute PCA: center the data t get X d SVD n X T-61.5060 -- slide set 6: dimensinality reductin 23

example f PCA PCA is used a lt fr data visualizatin example: spatial data analysis data: 9000 dialect wrds, 500 cunties in Finland wrd-cunty matrix X X(i, j) = 1 if wrd i appears in cunty j 0 therwise apply PCA n X T-61.5060 -- slide set 6: dimensinality reductin 24

example f PCA data pints: wrds; variables: cunties each principal cmpnent tells which cunties explain the mst significant part f the variatin left in the data the first principal cmpnent is essentially just the number f wrds in each cunty! after this, gegraphical structure f principal cmpnents is apparent nte: PCA knws nthing f the gegraphy f the cunties T-61.5060 -- slide set 6: dimensinality reductin 25

T-61.5060 -- slide set 6: dimensinality reductin 26

T-61.5060 -- slide set 6: dimensinality reductin 27

applicatins f PCA data visualizatin and explratin data cmpressin utlier detectin... T-61.5060 -- slide set 6: dimensinality reductin 28

randm prjectins vs. PCA different bjectives randm prjectins preserve distances PCA finds directins f maximum variance in the data PCA invlves SVD, very inefficient fr large data randm prjectins can be implemented very efficiently, especially, sparse variants T-61.5060 -- slide set 6: dimensinality reductin 29

10 0 10 Errr using RP, SRP, PCA and DCT randm prjectins vs. PCA 20 30 40 50 60 70 0 100 200 300 400 500 600 700 Reduced dim. f data flps 10 12 10 11 10 10 10 9 10 8 10 7 Flps needed using PCA, RP, SRP and DCT Figure 1: The errr prduced by RP (+), SRP ( ), PCA ( ) and DCT ( ) n image data, and 95 % cnfidence intervals ver 100 pairs f data vectrs. 10 6 0 100 200 300 400 500 600 700 Reduced dim. f data [Bingham and Mannila 2001] Figure 2: Number f Matlab s flating pint peratins needed when reducing the dimensinality f image data using RP (+), SRP ( ), PCA ( ) and DCT ( ), in a lgarithmic scale. T-61.5060 -- slide set 6: dimensinality reductin 30

thanks: slides n PCA adapted by slides f Saara Hyvönen T.61-5060 -- slide set 6: dimensinality reductin