Principal Components Analysis and Unsupervised Hebbian Learning

Similar documents
Radial Basis Function Networks: Algorithms

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Feedback-error control

LIMITATIONS OF RECEPTRON. XOR Problem The failure of the perceptron to successfully simple problem such as XOR (Minsky and Papert).

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2)

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Covariance and Correlation Matrix

CMSC 425: Lecture 4 Geometry and Geometric Programming

Metrics Performance Evaluation: Application to Face Recognition

Research of power plant parameter based on the Principal Component Analysis method

One step ahead prediction using Fuzzy Boolean Neural Networks 1

Hebbian Learning II. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. July 20, 2017

Hidden Predictors: A Factor Analysis Primer

Fuzzy Automata Induction using Construction Method

Introduction to MVC. least common denominator of all non-identical-zero minors of all order of G(s). Example: The minor of order 2: 1 2 ( s 1)

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

Hotelling s Two- Sample T 2

Using Factor Analysis to Study the Effecting Factor on Traffic Accidents

Numerical Linear Algebra

Estimation of the large covariance matrix with two-step monotone missing data

Approximating min-max k-clustering

MATH 2710: NOTES FOR ANALYSIS

Using a Computational Intelligence Hybrid Approach to Recognize the Faults of Variance Shifts for a Manufacturing Process

Semi-Orthogonal Multilinear PCA with Relaxed Start

arxiv: v2 [stat.ml] 7 May 2015

The Poisson Regression Model

Recursive Estimation of the Preisach Density function for a Smart Actuator

Notes on pressure coordinates Robert Lindsay Korty October 1, 2002

General Linear Model Introduction, Classes of Linear models and Estimation

Statics and dynamics: some elementary concepts

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Convex Analysis and Economic Theory Winter 2018

Positive decomposition of transfer functions with multiple poles

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

PHYS 301 HOMEWORK #9-- SOLUTIONS

Convex Optimization methods for Computing Channel Capacity

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Sums of independent random variables

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Participation Factors. However, it does not give the influence of each state on the mode.

Stochastic integration II: the Itô integral

Series Handout A. 1. Determine which of the following sums are geometric. If the sum is geometric, express the sum in closed form.

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Chapter 7 Rational and Irrational Numbers

Ž. Ž. Ž. 2 QUADRATIC AND INVERSE REGRESSIONS FOR WISHART DISTRIBUTIONS 1

TIME-FREQUENCY BASED SENSOR FUSION IN THE ASSESSMENT AND MONITORING OF MACHINE PERFORMANCE DEGRADATION

Department of Electrical and Computer Systems Engineering

State Estimation with ARMarkov Models

Named Entity Recognition using Maximum Entropy Model SEEM5680

2x2x2 Heckscher-Ohlin-Samuelson (H-O-S) model with factor substitution

arxiv: v1 [physics.data-an] 26 Oct 2012

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

RUN-TO-RUN CONTROL AND PERFORMANCE MONITORING OF OVERLAY IN SEMICONDUCTOR MANUFACTURING. 3 Department of Chemical Engineering

Cryptanalysis of Pseudorandom Generators

Pairwise active appearance model and its application to echocardiography tracking

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

3. Show that if there are 23 people in a room, the probability is less than one half that no two of them share the same birthday.

Analysis of Group Coding of Multiple Amino Acids in Artificial Neural Network Applied to the Prediction of Protein Secondary Structure

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

E( x ) [b(n) - a(n, m)x(m) ]

E( x ) = [b(n) - a(n,m)x(m) ]

Hebb rule book: 'The Organization of Behavior' Theory about the neural bases of learning

The analysis and representation of random signals

MATH 6210: SOLUTIONS TO PROBLEM SET #3

1 More General von Neumann Measurements [10 Points]

Bayesian Networks Practice

Linear diophantine equations for discrete tomography

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

NOTES. Hyperplane Sections of the n-dimensional Cube

Dynamic System Eigenvalue Extraction using a Linear Echo State Network for Small-Signal Stability Analysis a Novel Application

Unsupervised Hyperspectral Image Analysis Using Independent Component Analysis (ICA)

Components Analysis of Sampled Noisy Functions. Herve Cardot. Unite de Biometrie et Intelligence Articielle, INRA Toulouse, BP 27

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Estimation of Separable Representations in Psychophysical Experiments

An Analysis of Reliable Classifiers through ROC Isometrics

Topic 7: Using identity types

MATH 3240Q Introduction to Number Theory Homework 7

Recent Developments in Multilayer Perceptron Neural Networks

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

ECE 534 Information Theory - Midterm 2

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale

CSE 599d - Quantum Computing When Quantum Computers Fall Apart

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1)

COMMUNICATION BETWEEN SHAREHOLDERS 1

Some Unitary Space Time Codes From Sphere Packing Theory With Optimal Diversity Product of Code Size

Elementary Analysis in Q p

Location of solutions for quasi-linear elliptic equations with general gradient dependence

Collaborative Place Models Supplement 1

Design of NARMA L-2 Control of Nonlinear Inverted Pendulum

CSC165H, Mathematical expression and reasoning for computer science week 12

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

arxiv: v1 [cs.lg] 31 Jul 2014

8 STOCHASTIC PROCESSES

CHAPTER 3: TANGENT SPACE

Statistical Models approach for Solar Radiation Prediction

Transcription:

Princial Comonents Analysis and Unsuervised Hebbian Learning Robert Jacobs Deartment of Brain & Cognitive Sciences University of Rochester Rochester, NY 1467, USA August 8, 008 Reference: Much of the material in this note is from Haykin, S. (1994). Neural Networks. New York: Macmillan. Here we consider training a single layer neural network (no hidden units) with an unsuervised Hebbian learning rule. It seems sensible that we might want the activation of an outut unit to vary as much as ossible when given different inuts. After all, if this activation was a constant for all inuts then the activation would not be telling us anything about which inut is resent. In contrast, if the activation has very different values for different inuts then there is some hoe of knowing something about the inut based just on the outut unit s activation. So we might want to train the outut unit to adat its weights so as to maximize the variance of its activation. However, one way in which the unit might do this is by making its weights arbitrarily large. This would not be a satisfying solution. We would refer a solution in which the variance of the outut activation is maximized subject to the constraint that the weights are bounded. For examle, we might want to constrain the length of the weight vector to be one. Lagrange otimization is a method for maximizing (or minimizing) a function subject to one or more constraints. At this oint, we take a brief detour and give an examle of Lagrange otimization. From this examle, you should get the general idea of how this method works. Examle: Suose that we want to find the shortest vector y = [y 1 y ] T that lies on the line y 1 y 5 = 0. We write down an objective function L in which the first term states the function that we want to minimize, and the second term states the constraint: L = 1 (y 1 + y ) + λ(y 1 y 5) (1) where λ is called a Lagrange multilier (it is always laced in front of the constraint). The first term gives the length of the vector (actually its 1/ times the length of the vector squared); the second term gives the constraint. We now take derivatives, and set these derivatives equal to zero: y 1 = y 1 + λ = 0 () y = y λ = 0 (3) λ = y 1 y 5 = 0. (4) Note that we have three equations and three unknown variables. From the first derivative we know that y 1 = λ, and from the second derivative we know that y = λ. If we lug these values into 1

the third derivative, we get 4λ λ = 5 or that λ = 1. From this, we can now solve for y 1 and y ; in articular, y 1 = and y = 1. So the solution to our roblem is y = [ 1] T. Now let s turn to the roblem that we really care about. We want to maximize the variance of the outut unit s activation subject to the constraint that the length of the weight vector is equal to one. Without loss of generality, let s make things simler by assuming that the inut variables each have a mean of zero (E[x i ] = 0). Because the maing from inuts to the outut is linear (y = w T x = x T w), this means that the outut activation will also have a mean of zero (E[y] = 0). We are now ready to write down the objective function: L = 1 y + λ(1 w T w) (5) where y is the outut activation on data attern. The first term, 1 y, is roortional to the samle variance of the outut unit, and the second term, λ(1 w T w) gives the constraint that the weight vector should have a length of one. We now do some algebraic maniulations: L = 1 (w T x )(x T w) + λ(1 w T w) (6) = 1 w T (x x T )w + λ(1 w T w) (7) = 1 wt Qw + λ(1 w T w) (8) where Q = x x T is the samle covariance matrix. Now let s take the derivative of L with resect to w, and set this derivative equal to zero: = Qw λw = 0 (9) w which means that Qw = λw. In other words, w is an eigenvector of the covariance matrix Q (the corresonding eigenvalue is λ). So if we want to maximize the variance of the outut activation (subject to the constraint that the weight vector is of length one), then we should set the weight vector to be an eigenvector of the inut covariance matrix (in fact it should be the eigenvector with the largest eigenvalue). As discussed below, this has an interesting relationshi to rincial comonent analysis and unsuervised Hebbian learning. The idea behind rincial comonent analysis (PCA) is that we want to reresent the data with resect to a new basis, where the vectors that form this new basis are the eigenvectors of the data covariance matrix. Recall that eigenvectors with large eigenvalues give directions of large variance (again, we are considering the eigenvectors of a covariance matrix), whereas eigenvectors with small eigenvalues give directions of small variance (so the eigenvector with the largest eigenvalue gives the direction in which the inut data shows the greatest variance, and the eigenvector with the smallest eigenvalue gives the direction in which the inut data shows the least variance). Each eigenvalue gives the value of the variance in the direction of its corresonding eigenvector (thus the

eigenvalues are all ositive). Because the data covariance matrix is symmetric, its eigenvectors are orthogonal (as usual, we assume that the eigenvectors are of length one). Let x (t) = [x (t) 1,..., x(t) n ] T be a data item, and e i = [e 1,..., e n ] T be the i th eigenvector (assume that the eigenvectors are ordered such that the eigenvector with the largest eigenvalue is e 1, the eigenvector with the next largest eigenvalue is e, and so on). The rojection of x (t) onto e i is denoted y (t) i, and is given by the inner roduct of these two vectors: y (t) i = [x (t) ] T e i = e T i x (t). (10) That is, y (t) i is the coordinate of x (t) along the axis given by e i, and y (t) = [y (t) 1,..., y(t) n ] T is the reresentation of x (t) with resect to the new basis. Let E = [e 1,..., e n ] be the matrix whose i th column is the i th eigenvector. Then we can move back and forth between the two reresentations of the same data by the linear equations: y (t) = E T x (t), (11) and x (t) = Ey (t). (1) Please make sure that you understand these equations [we have used the fact that for an orthogonal matrix E (a matrix whose columns are orthogonal vectors), its inverse, E 1, is equal to its transose E T ]. Princial comonent analysis is useful for at least two reasons. First, as a feature detection mechanism. It is often the case that the eigenvectors give interesting underlying features of the data. PCA is also useful as a dimensionality reduction technique. Suose that we reresent the data using only the first m eigenvectors, where m < n (that is, suose we reresent x (t) with ŷ (t) = [y (t) 1,..., y(t) m ] T instead of y (t) = [y (t) 1,..., y(t) n ] T ). Out of all linear transformations that roduce an m-dimensional reresentation, this transformation is otimal in the sense that it reserves the most information about the data (yes, I am being vague here). That is, PCA can be used for otimal linear dimensionality reduction. In the remainder of this handout, we rove that the use of a articular Hebbian learning rule results in a network that erforms PCA. That is, the weight vector of the i th outut unit goes to the i th eigenvector e i of the data covariance matrix, and the outut of this unit y (t) i is the coordinate of the inut x (t) along the axis given by e i. For simlicity, we assume that we are dealing with a two-layer linear network with n inut units and only one outut unit (the result extends to the case with n outut units in which the weight vector of each outut unit is a different eigenvector, but we will not rove that here). We will also assume that the inut is a random variable with mean equal to zero. The learning rule is: w i (t + 1) = w i (t) + ɛ[y(t)x i (t) y (t)w i (t)] (13) 3

where w i (t) is the value of the outut unit s i th weight at time t. Note that the first term inside the brackets is the standard Hebbian learning term, and the second term is a tye of weight decay term that revents the weight from getting too big. In vector notation, this equation may be re-written: w(t + 1) = w(t) + ɛ[y(t)x(t) y (t)w(t)]. (14) Recall that we are using a linear network so that Using this fact, we re-write the weight udate rule as y(t) = w T (t)x(t) = x T (t)w(t). (15) w(t + 1) = w(t) + ɛ[x(t)x T (t)w(t) w T (t)x(t)x T (t)w(t)w(t)]. (16) Let C = E[x(t)x T (t)] be the data covariance matrix (we use the fact that the inut has a mean equal to zero). The udate rule converges when E[w(t + 1)] = E[w(t)]. Denote the exected value of w(t) at convergence as w o. Convergence occurs when the exected value of what is inside the brackets in the udate rule is equal to zero: 0 = E[x(t)x T (t)]w o w T o E[x(t)x T (t)]w o w o (17) = Cw o w T o Cw o w o. (18) In other words, where Cw o = λ o w o (19) λ o = w T o Cw o. (0) That is, w o is an eigenvector of the data covariance matrix C with λ o as its eigenvalue. We know that w o has length one because λ o = w T o Cw o (1) = w T o λ o w o () = λ o w o (3) which is only ossible if w o = 1. We also know that λ o is the variance in the direction of w o because λ o = w T o Cw o (4) = w T o E[x(t)x T (t)]w o (5) = E[{w T (t)x(t)}{x T (t)w(t)}] (6) = E[y (t)] (7) which is the variance of the outut (assuming that the outut has mean zero which is true given our assumtion that the inut has mean zero and that the inut and outut are linearly related). 4

We still need to rove that w o is the articular eigenvector with the largest eigenvalue, i.e. w o = e 1. This is trickier to rove so we shall omit it. This result is due to Oja (198). The extension to networks with n outut units in which the weight vector of each outut unit goes to a different eigenvector is due to Sanger (1989). The extension to networks (using Hebbian and anti-hebbian learning) with n outut units in which the n weight vectors san the same sace as the first n eigenvectors is due to Földiák (1990). Földiák, P. (1990) Adative network for otimal linear feature extraction. IEEE/INNS International Joint Conference on Neural Networks. Proceedings of the Oja, E. (198) A simlified neuron model as a rincial comonent analyzer. Journal of Mathematical Biology, 15, 67-73. Sanger, T.D. (1989) Otimal unsuervised learning in a single-layer linear feedforward neural network. Neural Networks, 1, 459-473. 5