Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise

Similar documents
A Conditional Expectation Approach to Model Selection and Active Learning under Covariate Shift

A new algorithm of non-gaussian component analysis with radial kernel functions

Non-Gaussian Component Analysis with Log-Density Gradient Estimation

Input-Dependent Estimation of Generalization Error under Covariate Shift

w 1 output input &N 1 x w n w N =&2

Principal Component Analysis

DS-GA 1002 Lecture notes 10 November 23, Linear models

Principal Component Analysis

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

There are two things that are particularly nice about the first basis

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Convergence of Eigenspaces in Kernel Principal Component Analysis

Designing Kernel Functions Using the Karhunen-Loève Expansion

Introduction to Machine Learning

EECS 275 Matrix Computation

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

The Subspace Information Criterion for Infinite Dimensional Hypothesis Spaces

Denoising and Dimension Reduction in Feature Space

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

1 Principal Components Analysis

Lecture Notes 1: Vector spaces

PCA, Kernel PCA, ICA

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

Optimum Sampling Vectors for Wiener Filter Noise Reduction

Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Lecture Notes 2: Matrices

Statistical Machine Learning

Degenerate Perturbation Theory. 1 General framework and strategy

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine learning for pervasive systems Classification in high-dimensional spaces

Exercises * on Principal Component Analysis

Advanced Introduction to Machine Learning CMU-10715

4 Bias-Variance for Ridge Regression (24 points)

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Mathematical foundations - linear algebra

Degenerate Perturbation Theory

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

The Singular-Value Decomposition

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Algebra II. Paulius Drungilas and Jonas Jankauskas

Matrix Vector Products

ESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING. Aapo Hyvärinen

Small sample size in high dimensional space - minimum distance based classification.

1 Cricket chirps: an example

Statistical Convergence of Kernel CCA

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Learning gradients: prescriptive models

Linear Regression and Its Applications

Linear Algebra for Machine Learning. Sargur N. Srihari

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Principal Components Analysis. Sargur Srihari University at Buffalo

ECE521 week 3: 23/26 January 2017

Unsupervised Learning

7. Variable extraction and dimensionality reduction

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

Sample ECE275A Midterm Exam Questions

March 27 Math 3260 sec. 56 Spring 2018

Principal Component Analysis

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Nonlinear Dimensionality Reduction

A PRIMER ON SESQUILINEAR FORMS

Algebraic Information Geometry for Learning Machines with Singularities

Least Sparsity of p-norm based Optimization Problems with p > 1

Kernels A Machine Learning Overview

Unsupervised learning: beyond simple clustering and PCA

Maximum variance formulation

On V-orthogonal projectors associated with a semi-norm

Economics 204 Fall 2011 Problem Set 2 Suggested Solutions

Probabilistic & Unsupervised Learning

Kernel Methods. Machine Learning A W VO

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Direct Density-ratio Estimation with Dimensionality Reduction via Least-squares Hetero-distributional Subspace Search

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Approximation of Functions by Multivariable Hermite Basis: A Hybrid Method

Principal Component Analysis

3 Compact Operators, Generalized Inverse, Best- Approximate Solution

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

From independent component analysis to score matching

Class Prior Estimation from Positive and Unlabeled Data

Review and problem list for Applied Math I

MATH Linear Algebra

Deep Learning Book Notes Chapter 2: Linear Algebra

Lecture Notes on the Gaussian Distribution

Lecture Notes 5: Multiresolution Analysis

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Chapter 3 Transformations

Moore-Penrose Conditions & SVD

Theoretical and Experimental Evaluation of Subspace Information Criterion

ITK Filters. Thresholding Edge Detection Gradients Second Order Derivatives Neighborhood Filters Smoothing Filters Distance Map Image Transforms

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Cheng Soon Ong & Christian Walder. Canberra February June 2018

FILTRATED ALGEBRAIC SUBSPACE CLUSTERING

October 25, 2013 INNER PRODUCT SPACES

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Linear & nonlinear classifiers

Linear Methods for Regression. Lijun Zhang

Transcription:

IEICE Transactions on Information and Systems, vol.e91-d, no.5, pp.1577-1580, 2008. 1 Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise Masashi Sugiyama (sugi@cs.titech.ac.jp) Department of Computer Science, Tokyo Institute of Technology 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan Motoaki Kawanabe (motoaki.kawanabe@first.fraunhofer.de) Fraunhofer FIRST.IDA, Kekuléstrasse 7, 12489 Berlin, Germany Gilles Blanchard (gilles.blanchard@first.fhg.de) Fraunhofer FIRST.IDA, Kekuléstrasse 7, 12489 Berlin, Germany Klaus-Robert Müller (klaus@first.fraunhofer.de) Department of Computer Science, Technical University of Berlin Franklinstrasse 28/29, 10587 Berlin, Germany Abstract Obtaining the best linear unbiased estimator (BLUE) of noisy signals is a traditional but powerful approach to noise reduction. Explicitly computing the BLUE usually requires the prior knowledge of the noise covariance matrix and the subspace to which the true signal belongs. However, such prior knowledge is often unavailable in reality, which prevents us from applying the BLUE to real-world problems. To cope with this problem, we give a practical procedure for approximating the BLUE without such prior knowledge. Our additional assumption is that the true signal follows a non-gaussian distribution while the noise is Gaussian. Keywords signal denoising, best linear unbiased estimator (BLUE), non-gaussian component analysis (NGCA), Gaussian noise.

Approximating the Best Linear Unbiased Estimator 2 1 Introduction and formulation Let x R d be the observed noisy signal, which is composed of an unknown true signal s and unknown noise n. x = s + n. (1) We treat s and n as random variables (thus x also), and assume s and n are statistically independent. We further suppose that the true signal s lies within a subspace S R d of known dimension m = dim(s), where 1 m < d. On the other hand, the noise n spreads out over the entire space R d and is assumed to be of mean zero. Following this generative model, we are given i.i.d. observations {x i } n i=1. Our goal is to obtain denoised signals that are close to the true signals {s i } n i=1. A standard approach to noise reduction in this setting is to project the noisy signal x onto the true signal subspace S, by which the noise is reduced while the signal component s is still preserved. Here the projection does not have to be orthogonal, thus we may want to optimize the projection direction so that the maximum amount of noise can be removed. In statistics, the linear estimator which fulfills the above requirement is called the best linear unbiased estimator (BLUE) [1]. The BLUE has the minimum variance among all linear unbiased estimators. More precisely, the BLUE of s denoted by ŝ is defined by where ŝ = Hx, (2) H argmin H R d d E n Hx E n Hx 2 subject to E n [ Hx] = s. (3) In the above equation, E n denotes the expectation over the noise n. Let Q be the noise covariance matrix: Q E n [nn ], (4) which we assume to be non-degenerated. Let P be the orthogonal projection matrix onto the subspace S. Then, the estimation matrix H defined by Eq.(3) is given as (see e.g., [1, 4]) H = (P Q 1 P ) P Q 1, (5) where denotes the Moore-Penrose generalized inverse. Let N QS, where S is the orthogonal complement of S. Then it is possible to show that H is an oblique projection onto S along N (see e.g., [4]): Hx = { x if x S, 0 if x N. (6) This is illustrated in Figure 1.

Approximating the Best Linear Unbiased Estimator 3 Figure 1: Illustration of the BLUE. When computing the BLUE by Eqs.(2) and (5), the noise covariance matrix Q and the signal subspace S should be known. However, Q and S are often unknown in practice, and estimating them from data samples {x i } n i=1 is not generally a straightforward task. For this reason, the applicability of the BLUE to real-world problems has been rather limited so far. In this paper, we therefore propose a new method which enables us to obtain an approximation of the BLUE even in the absence of the prior knowledge of Q and S. Our procedure is essentially based on two facts: the noise covarince matrix Q is actually not required for computing the BLUE (see Section 2) and the signal subspace S can be estimated by a recently proposed method of dimensionality reduction called non-gaussian component analysis [2, 3] (see Section 3). We effectively combine these two ideas and propose a practical procedure for approximating the BLUE. Our additional assumption is that the true signal follows a non-gaussian distribution while the noise is Gaussian. 2 Computing the BLUE without knowledge of the noise covariance matrix Q In this section, we show that the noise covariance matrix Q is not needed in computing the BLUE. Let Σ E x [xx ]. Then the following lemma holds. Lemma 1 The subspace N is expressed as N = ΣS. Since s and n are independent and E n [n] = 0, the above lemma can be immediately confirmed as ΣS = E s [ss S ] + E x [sn + ns ]S + E n [nn ]S = QS. (7) Lemma 1 implies that the noise reduction subspace N can be characterized by Σ, without using the noise covariance matrix Q. Then the following lemma holds. Lemma 2 The estimation matrix H is expressed as H = (P Σ 1 P ) P Σ 1.

Approximating the Best Linear Unbiased Estimator 4 Letting P be the orthogonal projection matrix onto S, we have, for any x R d, (P Σ 1 P ) P Σ 1 (P x) = (P Σ 1 P ) (P Σ 1 P )x (P Σ 1 P ) P Σ 1 (ΣP x) = (P Σ 1 P ) (P P )x = P x, (8) = 0, (9) from which the above lemma can be confirmed. Lemma 2 implies that we can obtain the BLUE using Σ, without using Q. Roughly speaking, the N -part of Q should agree with that of Σ because the signal s lies only in S. Therefore, it intuitively seems that we can replace Q in Eq.(5) by Σ because H only affects the component in N (see Eq. (6)). The above lemma theoretically supports this intuitive claim. A practical advantage of the above lemma is that, while estimating the noise covariance matrix Q from the data samples {x i } n i=1 is not a straightforward task in general, Σ can be directly estimated in a consistent way as Σ = 1 n n x i x i. (10) i=1 3 Estimating the signal subspace S by non-gaussian component analysis The remaining issue to be discussed is how to estimate the true signal subspace S (or the orthogonal projection matrix P ). Here we additionally assume that the signal s follows a non-gaussian distribution, while the noise n follows a Gaussian distribution. The papers [2, 3] proposed a framework called non-gaussian component analysis (NGCA). First we briefly review the key idea of NGCA and then show how we employ NGCA for finding the signal subspace S. 3.1 Non-Gaussian component analysis Suppose we have i.i.d. observations {x i} n i=1, which follow a d-dimensional distribution with the following semiparametric probability density function: p(x ) = f(t x )ϕ Q (x ), (11) where f is a function from R m to R, T is an m d matrix, and ϕ Q is the centered Gaussian density with covariance matrix Q. Note that we know neither p, f, T, nor Q, but we only know that the density p(x ) is of the form of Eq.(11). Let S = R(T ), where R( ) is the range of a matrix. Then the following proposition holds.

Approximating the Best Linear Unbiased Estimator 5 Proposition 1 [2, 3] Suppose E x [x x ] = I d, where I d is the d-dimensional identity matrix. For an arbitrary smooth function h(x ) from R d to R, where g(x ) is a function from R d to R d defined by is the differential operator. β E x [g(x )] S, (12) g(x ) x h(x ) h(x ). (13) The above proposition shows that we can construct vectors which belong to S. Generally, different functions h(x ) produce different vectors β. Therefore, we may construct a set of vectors which spans the entire subspace S. However, β can not be computed in practice since the unknown p(x ) is included through E x in Eq.(12); we employ its consistent estimator given by β = 1 n g(x n i). (14) i=1 By applying principal component analysis (PCA) to the set of β obtained from a set of different h(x ), we can estimate the subspace S. Eqs.(14) and (13) imply that the mapping from h to β is linear. This means that we can arbitrarily change the norm of β just by multiplying h by an arbitrary scalar. This can totally change the PCA results since vectors with larger norm have stronger impacts on the PCA solutions. Therefore, we need to normalize β (or h) in a reasonable way. A desirable normalization scheme would possess a property that resulting β has a larger norm if it is close to S (i.e., the angle between β and S is small). This desirable property can be approximately achieved by normalizing β by its standard deviation [2]. A consistent estimator of the variance of β is given by N = 1 n n g(x i) 2 β 2. (15) i=1 Further discussions of NGCA including sophisticated implementation, rigorous theoretical error analysis, and extensive experimental studies are given in [2, 3]. 3.2 Applying NGCA for estimating signal subspace S Here we show how NGCA is employed for approximating the BLUE. Let us apply the whitening transformation to the data samples, i.e., Then the following lemma holds: x = Σ 1 2 x. (16)

Approximating the Best Linear Unbiased Estimator 6 Lemma 3 The probability density function of x is expressed in the form of Eq.(11) with R(T ) = Σ 1 2 S. (Proof) Let us denote S Σ 1 2 S and N S = Σ 1 2 S = Σ 1 2 N (via Lemma 1). Decompose the whitened sample as follows: x = s + n S + n N, (17) with s = Σ 1 2 s, and n S, n N the orthogonal projections of 1 Σ 2 n on S, N. By characterization of the whitening transform, we have E x [x x ] = I d. Since n N and (s + n S ) are the orthogonal projections of x onto two orthogonal subspaces, this entails O = E n [n N (s + n S ) ] = E n [n N n S ], (18) where O is the null matrix, and the second equality holds because s and n are independent. This shows that n S and n N are independent. Let us consider an orthonormal basis [T S, T N ] such that R(T S ) = S and R(T N ) = N. Let z S T S x = T S s + T S n S, (19) z N T N x = T N n N. (20) Since s follows a non-gaussian distribution, z S is also non-gaussian. Let us denote the probability density function of z S by q(z S ). On the other hand, z N follows a centered Gaussian distribution with identity covariance matrix I d m. Since z S and z N are independent, the probability density function of x is given by p(x ) = q(z S )ϕ Id m (z N ) = q(z S ) ϕ Im (z S ) ϕ I m (z S )ϕ Id m (z N ) = q(z S ) ϕ Id m (z S ) ϕ I d ([z S, z N ] ) = q(t S x ) ϕ Id m (T S x ) ϕ I d (x ). (21) Putting f(z) = q(z), T = T ϕ Id m (z) S, and Q = I d concludes the proof. Lemma 3 shows that, by NGCA, we can estimate S from the whitened samples x. Let P be the orthogonal projection matrix onto the subspace S. We claim that H = Σ 1 2 P Σ 1 2, which is readily checked by the characterization (6), using the fact that N = Σ 1 2 S. Therefore, the BLUE ŝ can be expressed in terms of P and x as ŝ = Σ 1 2 P x. (22) Finally, replacing all the unknown quantities by their consistent estimators, we have an algorithm for approximating the BLUE (see Figure 2).

Approximating the Best Linear Unbiased Estimator 7 Prepare different functions {h k (x )} L k=1. x i = Σ 1 2 x i with Σ = 1 n n i=1 x ix i. β k = 1 n n i=1 g k(x i) with g k (x ) = x h k (x ) h k (x ). N k = n i=1 g k(x i) 2 β k 2. { ψ i } m i=1: m leading eigenvectors of L β k=1 k β k / N k. P = m ψ i=1 i ψ i. ŝ i = Σ 1 2 P x i. 4 Numerical example Figure 2: Algorithm for approximating the BLUE. We created two-dimensional samples, where S = (1, 0), the first element of s( follows ) 1 1 the uniform distribution on (0, 10), the second element of s is zero, and Q =. 1 2 We used h i (x) = sin(w i x) (i = 1, 2,..., 1000) in NGCA, where w i is taken from the centered normal distribution with identity covariance matrix. The samples in the original space, the transformed samples in the whitened space, and the β functions are plotted in Figure 3. The mean squared errors of {ŝ i } n i=1 obtained by the true BLUE (using the knowledge of Q and S) and the proposed NGCA-based method as a function of the number of samples are depicted in Figure 4. This illustrates the usefulness of our NGCA-based approach. 5 Conclusions For computing the BLUE, prior knowledge of the noise covariance matrix Q and the signal subspace S is needed. In this paper, we proposed an algorithm for approximating the BLUE without such prior knowledge. Our assumption is that the signal is non- Gaussian and the noise is Gaussian. We described a naive implementation of NGCA for simplicity. We may employ a more sophisticated implementation of NGCA such as multi-index projection pursuit (MIPP) algorithm [2] and iterative metric adaptation for radial kernel functions (IMAK) [3], which would further improve the accuracy. Acknowledgments We acknowledge partial financial support by DFG, BMBF, the EU NOE PASCAL (EU # 506778), Alexander von Humboldt Foundation, MEXT (18300057 and 17700142), and Tateishi Foundation.

Approximating the Best Linear Unbiased Estimator 8 6 x S N 4 S hat N hat 2 0 2 4 0 2 4 6 8 10 2 x hat S 1.5 N S hat 1 N hat 0.5 0 0.5 1 1.5 2 1 0 1 2 3 10 β hat S 8 N 6 S tilde hat N hat 4 2 0 2 4 6 8 10 10 5 0 5 10 (a) Samples in the original space (b) Samples in the whitened space (c) β used in NGCA Figure 3: Illustration of simulation when n = 100. We used 1000 β functions, but plotted only 100 points for clear visibility. 10 9 True BLUE NGCA BLUE 8 7 6 MSE 5 4 3 2 1 0 0 200 400 600 800 1000 n Figure 4: Mean squared error n i=1 (ŝ i s i ) 2 /n averaged over 100 runs References [1] A. Albert, Regression and the Moore-Penrose Pseudoinverse, Academic Press, New York and London, 1972. [2] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.R. Müller, In search of non-gaussian components of a high-dimensional distribution, Journal of Machine Learning Research, vol.7, pp.247 282, Feb. 2006. [3] M. Kawanabe, M. Sugiyama, G. Blanchard, and K.R. Müller, A new algorithm of non-gaussian component analysis with radial kernel functions, Annals of the Institute

Approximating the Best Linear Unbiased Estimator 9 of Statistical Mathematics, vol.59, no.1, pp.57 75, 2007. [4] H. Ogawa and N.E. Berrached, EPBOBs (extended pseudo biorthogonal bases) for signal recovery, IEICE Transactions on Information and Systems, vol.e83-d, no.2, pp.223 232, 2000.