Principal Components Analysis

Similar documents
Feature Extraction Techniques

3.3 Variational Characterization of Singular Values

Introduction to Machine Learning. Recitation 11

OBJECTIVES INTRODUCTION

1 Bounding the Margin

Unsupervised Learning: Dimension Reduction

Block designs and statistics

Kernel Methods and Support Vector Machines

Support Vector Machines MIT Course Notes Cynthia Rudin

Chaotic Coupled Map Lattices

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

A Simple Regression Problem

Multi-Scale/Multi-Resolution: Wavelet Transform

Measures of average are called measures of central tendency and include the mean, median, mode, and midrange.

Pattern Recognition and Machine Learning. Artificial Neural networks

Ensemble Based on Data Envelopment Analysis

Kinetic Theory of Gases: Elementary Ideas

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Chapter 6 1-D Continuous Groups

Kinetic Theory of Gases: Elementary Ideas

Bootstrapping Dependent Data

Lecture #8-3 Oscillations, Simple Harmonic Motion

Topic 5a Introduction to Curve Fitting & Linear Regression

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Kinematics and dynamics, a computational approach

In this chapter, we consider several graph-theoretic and probabilistic models

PHY 171. Lecture 14. (February 16, 2012)

COS 424: Interacting with Data. Written Exercises

A remark on a success rate model for DPA and CPA

arxiv: v1 [stat.ot] 7 Jul 2010

Ch 12: Variations on Backpropagation

The Transactional Nature of Quantum Information

Combining Classifiers

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Least Squares Fitting of Data

Tracking using CONDENSATION: Conditional Density Propagation

General Properties of Radiation Detectors Supplements

Support Vector Machines. Maximizing the Margin

Pattern Recognition and Machine Learning. Artificial Neural networks

CS Lecture 13. More Maximum Likelihood

Estimating Parameters for a Gaussian pdf

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Now multiply the left-hand-side by ω and the right-hand side by dδ/dt (recall ω= dδ/dt) to get:

Non-Parametric Non-Line-of-Sight Identification 1

Multivariate Methods. Matlab Example. Principal Components Analysis -- PCA

Module #1: Units and Vectors Revisited. Introduction. Units Revisited EXAMPLE 1.1. A sample of iron has a mass of mg. How many kg is that?

Introduction to Discrete Optimization

1 Proof of learning bounds

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Slide10. Haykin Chapter 8: Principal Components Analysis. Motivation. Principal Component Analysis: Variance Probe

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

1 Generalization bounds based on Rademacher complexity

RESTARTED FULL ORTHOGONALIZATION METHOD FOR SHIFTED LINEAR SYSTEMS

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

IN modern society that various systems have become more

Chapter 2 General Properties of Radiation Detectors

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Lecture 9 November 23, 2015

Probability Distributions

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Unsupervised Learning

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Lean Walsh Transform

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields

Particle dynamics Physics 1A, UNSW

Lecture 21. Interior Point Methods Setup and Algorithm

Projectile Motion with Air Resistance (Numerical Modeling, Euler s Method)

Pattern Recognition and Machine Learning. Artificial Neural networks

Linear Transformations

Homework 3 Solutions CSE 101 Summer 2017

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A New Class of APEX-Like PCA Algorithms

Physics 139B Solutions to Homework Set 3 Fall 2009

Supplementary Information for Design of Bending Multi-Layer Electroactive Polymer Actuators

Theory and Applications of the Indoor Noise Module (VDI 3760)

arxiv: v1 [math.nt] 14 Sep 2014

Machine Learning Basics: Estimators, Bias and Variance

I. Understand get a conceptual grasp of the problem

A proposal for a First-Citation-Speed-Index Link Peer-reviewed author version

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

An Approximate Model for the Theoretical Prediction of the Velocity Increase in the Intermediate Ballistics Period

Chapter 1: Basics of Vibrations for Simple Mechanical Systems

Reading from Young & Freedman: For this topic, read the introduction to chapter 25 and sections 25.1 to 25.3 & 25.6.

Testing equality of variances for multiple univariate normal populations

Inference in the Presence of Likelihood Monotonicity for Polytomous and Logistic Regression

Anisotropic reference media and the possible linearized approximations for phase velocities of qs waves in weakly anisotropic media

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

A Quantum Observable for the Graph Isomorphism Problem

6.2 Grid Search of Chi-Square Space

Supplemental Material for Correlation between Length and Tilt of Lipid Tails

An Improved Particle Filter with Applications in Ballistic Target Tracking

Physics 202H - Introductory Quantum Physics I Homework #12 - Solutions Fall 2004 Due 5:01 PM, Monday 2004/12/13

Analysis of Hu's Moment Invariants on Image Scaling and Rotation

On Constant Power Water-filling

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

Automated Frequency Domain Decomposition for Operational Modal Analysis

Transcription:

Principal Coponents Analysis Cheng Li, Bingyu Wang Noveber 3, 204 What s PCA Principal coponent analysis (PCA) is a statistical procedure that uses an orthogonal transforation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal coponents. The nuber of principal coponents is less than or equal to the nuber of original variables. In another word, PCA tries to identify the subspace in which the data approxiately lies. Here are two exaples given by Andrew Ng s Notes: ) Given a dataset {x (i) ; i =,..., } of attributes of different types of autoobiles, such as their axiu speed, turn radius, and so on. Let x (i) R n for each i. But unknown to us, two different features-soe x i and x j -respectively give a car s axiu speed easured in iles per hour, and axiu speed easured in kiloeters per hour. These two features are therefore alost linearly dependent, up to only sall differences introduced by rounding off to the nearest ph and kph. Thus the data really lies approxiately on an n diensinal subspace. How can we autoatically detect, and perhaps reove, this redundancy? 2) For a less contrived exaple, consider a dataset resulting fro a survey of pilots for radio-controlled helicopters, where x i is a easure of the piloting skill of pilot i, and x i 2 captures how uch he/she enjoys flying. Because RC helicopters are very difficult to fly, only the ost coitted students, ones that truly enjoy flying, becoe a good pilots. So, the two features x and x 2 are strongly correlated. Indeed, we ight posit that the data actually likes along soe diagonal axis (the u direction) capturing the intrinsic piloting kara of a person, with sall aount of noise lying off this axis. (See Figure) How can we autoatically copute this u direction? 2 Prior PCA algorith Before running PCA, typically we first pre-process the data to noralize its ean and variance, as following:. Let u = x(i) 2. Replace each x (i) with x (i) u 3. Let σ 2 j = (x(i) j )2

Figure : Pilot s skill and enjoyent relationship. 4. Replace each x (i) j with x(i) j σ j Step 2 zero out of the ean of the data, and ay be oitted for data known to have zero ean(for instance, tie series corresponding to speech or other acoustic signals.) Step 3 4 rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the sae scale. For instance, if x was car s axiu speed in ph(taking values fro {0 200}) and x 2 are the nuber of seats(taking values fro {2 8}), then this renoralization rescales the different features to ake the ore coparable. Step 3 4 ay be oitted if we had apriori knowledge that the different features are all on the sae scale(for instance, the MNIST digits dataset). 3 PCA Theory After the prior Steps 4, which has been described above, all we have to do is just only an eigenvector calculation. Then we just choose the top k eigenvectors as the new subspace. Why does it work and what is the theory behind the PCA? In fact, there are any theories can explain the PCA. But we just choose one of the, called Maxiu Variance theory. 3. Maxiu Variance Theory Having carried out the noralization, how do we copute the ajor axis of variation u-that is, the direction on which the data approxiately lies? One way to pose this proble is as finding the unit vector u so that when the data is projected onto the direction corresponding to u, the variance of the projected onto the direction corresponding to u, the variance of the projected data is axiized. Intuitively, the data starts off with soe aount of variance/inforation in it. We would like to choose a direction u so that if we were to approxiate the 2

data as lying in the direction/subspace corresponding to u, as uch as possible of this variance is still retained. 3.2 Analysis Consider the following dataset(see Figure 2), on which we have already carried out the noralization steps: Now, suppose we pick u to correspond the direction Figure 2: Saple data in 2-Diention shown in Figure 3. The circles denotes the projections of the original data onto this line. We see that in Figure 3 the projected data still has a fairly large variance, and the points tend to be far fro zero. In contrast, suppose had instead picked the following direction(figure 4): In the Figure 4, the projections have a significantly saller variance, and are uch closer to the origin. We would like to autoatically select the direction u corresponding to the Figure 3 with axiu variance. First we should know how to calculate the distance of the point s projection onto u fro the origin. Note that given a unit vector u and a point x, the length of the projection of x onto u is given by x T u. So the proble can be transfored to atheatic proble as following: choose u so that: ax u: u (x (i)t u) 2 = 3

Figure 3: Saple data with Maxiu Variance Figure 4: Saple data with Miniu Variance Next, we will derivate the foular as following: 4

= ax u: u = = ax u: u = = ax u: u = = ax u: u = (x (i)t u) 2 (x (i)t u) T (x (i)t u) (u T x (i) )(x (i)t u) u T x (i) x (i)t u = ax u: u u T ( = x (i) x (i)t )u = ax u: u u T Σu, where Σ = = x (i) x (i)t Now we can use Lagrange ultiplier to continue: = ax u T Σu subject u T u = (Because u = ) = L(u, λ) = u T Σu + λ(u T u ) = L(u, λ) = (ut Σu + λu T u) = (ut Σu) + (λut u) Since u R n, which is a colun vector. According to Matrix calculus, which is shown as following: x T Ax = 2Ax x x T x x = 2x where A is not a function of x, A is syetric, and x is a row vector. Then we get: (u T Σu) + (λut u) =2Σu + 2λu Set = 0 Since λ is a scalar, we can replace λ by λ. Finally, we get the Σu = λu, where u is the engivector of Σ and λ is the engivalue of Σ. To suarize, we have found that if we wish to find a -diensional subspace with to approxiate the data, we should choose u to be the principal eigenvector http://en.wikipedia.org/wiki/matrix calculus 5

of Σ. More generally, if we wish to project our data to k-diensional subspace (k < n), we should choose u, u 2,..., u k to be the top k vectors of Σ. The u i s now for a new, orthogonal basis for the data. 2 Then, to represent x (i) in this basis, we need only copute the corresponding vector u T x (i) ˆx (i) = u T 2 x (i)... Rk. u T k x(i) Thus, whereas x (i) R n, the vector ˆx (i) now gives a lower, k-diensional, approxiation/representation for x (i). That is to say, x (i) is the original datapoint and ˆx (i) is the new datapoint after PCA. PCA is therefore also referred to as a diensionality reduction algorith. The vectors u,..., u k are called the first k principal coponents of the data. 4 References PCA Lecture Notes by Andrew Ng(Stanford Univ.). other. 2 Because Σ is syetric, the u i s will (or always can be chosen to be) orthogonal to each 6