Second-Order Inference for Gaussian Random Curves

Size: px

Start display at page:

Download "Second-Order Inference for Gaussian Random Curves"

Sharleen Day
5 years ago
Views:

1 Second-Order Inference for Gaussian Random Curves With Application to DNA Minicircles Victor Panaretos David Kraus John Maddocks Ecole Polytechnique Fédérale de Lausanne Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 1 / 48

2 Are they different? Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 2 / 48

3 Outline 1 DNA minicircles 2 Functional Data Analysis background 3 Testing procedures 4 Analysis of DNA minicircles 5 Summary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 3 / 48

4 Outline 1 DNA minicircles 2 Functional Data Analysis background 3 Testing procedures 4 Analysis of DNA minicircles 5 Summary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 4 / 48

5 Geometry of DNA molecules, base-pairs sequences Goal: study of mechanical properties of DNA molecules (via their geometry) Question: influence of base pair sequences 158 base-pairs, 18 bp fragment different (TATA vs CAP) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 5 / 48

6 Electron microscope images 50 nm layer of ice at 170 C, images tilted ±15 Minicircles diam 17 nm Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 6 / 48

7 Original data y z z x x y Each curve 200 xyz-coordinates Curves not directly comparable Adjustment Centering (center of mass = 0) Scaling (length = 1) Not sufficient, further alignment necessary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 7 / 48

8 Registration of functional data Standard alignment methods 1 Landmark alignment 2 Warping Standard methods cannot be used Landmark alignment not applicable: no landmarks available Warping inappropriate: don t want to change the shape, need a rigid method Curves have no beginning/end, no orientation Rotate each curve to make them as close as possible Global optimization over n = 99 orthogonal transformations would be difficult Instead, rotate each curve separately Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 8 / 48

9 Moments of inertia tensor Consider an object in R 3 with distribution of mass µ For DNA minicircles, µ is the uniform measure supported on the curve Consider an axis given by a unit vector u R 3 ( u = 1) Moment of inertia tensor defined as J (u) = r 2 (u, x)µ(dx) = R 3 (I uu T )x 2 µ(dx) R 3 (integrated squared distance from the axis given by u) Interpretation: J (u) measures how difficult it is to rotate the object around the axis u In matrix form J (u) = u T Ju, where J = (x T xi xx T )µ(dx) R 3 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 9 / 48

10 Principal axes of inertia J (u) = u T Ju is positive semidefinite, hence it possesses nonnegative eigenvalues and orthonormal eigenvectors The first eigenvector w 1 determines the axis around which the curve is most difficult to rotate (J (u) = u T Ju is maximized at u = w 1 ) The projection of the curve on the plane orthogonal to w 1 is most spread The second eigenvector w 2 determines the axis within the first principal plane around which the projected curve is most difficult to rotate Within the first principal plane, the projection on the line orthogonal to w 2 is most spread The axes given by w 1, w 2, w 3 are called principal axes of inertia (PAI1, PAI2, PAI3) PAI3 carries the most spatial information, PAI1 contains the smallest amount of information Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 10 / 48

11 Moments of inertia alignment Each curve aligned separately (no averaging over the sample) For each curve, the principal axes of inertia are determined and the curve is rotated so that the PAI s agree with the axes of the coordinate system (i.e., the curves are projected on PAI s) The procedure is similar to the balancing of a tyre (adjusting the distribution of mass of a wheel such that its PAI is aligned with the axle) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 11 / 48

12 Aligned DNA minicircles TATA CAP PAI3 PAI3 PAI2 PAI1 PAI1 PAI3 PAI2 PAI1 PAI1 PAI3 PAI2 PAI2 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 12 / 48

13 From curves to functions Curves have no starting point The intersection of the projection on the first principal plane and the horizontal (PAI2) positive semi-axis Curves have no orientation Counterclockwise in the first principal plane No correspondence between points on the curves Parametrization by arc length PAI PAI2 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 13 / 48

14 PAI coordinates of aligned DNA minicircles Principal Axis 3, TATA Principal Axis 2, TATA Principal Axis 1, TATA Principal Axis 3, CAP Principal Axis 2, CAP Principal Axis 1, CAP Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 14 / 48

15 Outline 1 DNA minicircles 2 Functional Data Analysis background 3 Testing procedures 4 Analysis of DNA minicircles 5 Summary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 15 / 48

16 Functional data Each minicircle curve is modelled as the realisation of a stochastic process indexed by [0, 1], taking values in R 3 X = {X(t), t [0, 1]} X is seen as a random element of the Hilbert space L 2 [0, 1] of coordinate-wise square-integrable R 3 -valued functions with the inner product f, g = 1 0 f (t), g(t) dt Wlog assume that the mean function is zero µ(t) = E X(t) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 16 / 48

17 Covariance operator Denote the covariance function (kernel) R(s, t) = cov(x(s), X(t)) = E(X(s)X(t) T ) The covariance operator is defined as Equivalently R : L 2 [0, 1] L 2 [0, 1] R(f ) = cov( X, f X) = 1 0 R = E(X X) R(, t)f (t)dt, The tensor product of a, b L 2 [0, 1] is defined as the operator (a b) : L 2 [0, 1] L 2 [0, 1], (a b)(f ) = 1 Multivariate analog: a b = ab T for a, b R p 0 a( ) b(t), f (t) dt Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 17 / 48

18 Karhunen Loève decomposition The covariance kernel admits the representation R(s, t) = λ k ϕ k (s)ϕ k (t) T, k=1 where λ k 0 are nonincreasing eigenvalues and ϕ k orthonormal eigenfunctions of R, i.e., R(ϕ k ) = λ k ϕ k The process X can be represented as X(t) = X, ϕ k ϕ k (t) = k=1 k=1 λ 1/2 k ξ k ϕ k (t) where the Fourier coefficients ξ k = λ 1/2 k X, ϕ k are uncorrelated random variables with zero mean and unit variance If the process is Gaussian, the scores ξ k, k 1 are iid standard Gaussian Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 18 / 48

19 Truncated KL decomposition, dimension reduction The first K eigenelements λ k, ϕ k, k = 1,..., K provide the optimal rank K approximation in the sense min ϕ 1,...,ϕ K orthonormal E X min E X X ϕ 1,...,ϕ K orthonormal, λ 1,...,λ K 0 max ϕ 1,...,ϕ K orthonormal K 2 X, ϕ k ϕ k k=1 2 K λ k (ϕ k ϕ k ) k=1 K var( X, ϕ k ) k=1 2 HS Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 19 / 48

20 Functional Principal Component Analysis Functional Principal Component Analysis is the empirical version of the Karhunen Loève decomposition Empirical covariance operator ˆR = 1 n (X i n X) (X i X) i=1 Functional eigenproblem ˆR ˆϕ k = ˆλ k ˆϕ k Usually, observations are represented in a basis, X i (t) = c ij ψ j (t) j=1 If the basis {ψ j } is orthonormal (such as the Fourier basis for periodic data like minicircles), then functional PCA is usual PCA of the coefficient matrix C = (c ij ) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 20 / 48

21 Outline 1 DNA minicircles 2 Functional Data Analysis background 3 Testing procedures 4 Analysis of DNA minicircles 5 Summary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 21 / 48

22 Means of PAI coordinates Principal Axis 3, TATA Principal Axis 3, all Principal Axis 3, CAP Principal Axis 2, TATA Principal Axis 2, all Principal Axis 2, CAP Principal Axis 1, TATA Principal Axis 1, all Principal Axis 1, CAP Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 22 / 48

23 Means on the principal plane Proj. on Prin. Plane 1, TATA Proj. on Prin. Plane 1, all Proj. on Prin. Plane 1, CAP PAI PAI PAI PAI PAI PAI2 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 23 / 48

24 Covariance functions covariance 0 covariance Covariance of PAI3, TATA Covariance of PAI3, CAP PAI3 PAI3 PAI3 PAI3 PAI3 0 3e 04 2e 04 1e e e 04 2e 04 1e 04 3e PAI e e e e e 04 5e 05 1e PAI3 PAI3 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 24 / 48

25 Testing problem Situation: X 1,..., X n1, Y 1,..., Y n2 independent samples of Gaussian stochastic processes with means µ X, µ Y and covariance operators R X, R Y Mean functions appear to be equal Focus on covariance operators Hypothesis testing problem H 0 : R X = R Y vs H 1 : R X R Y Use of a statistic like ˆR 1 X Instead, use the difference ˆR Y impossible (noninvertibility) ˆR X ˆR Y which should be close to the zero operator under the null Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 25 / 48

26 Hilbert Schmidt operator norm Need to measure the distance of ˆRX ˆR Y from a zero operator Need an operator norm The Hilbert Schmidt operator norm is defined as A 2 HS = i A e i 2 2 = i,j e i, A e j 2 (multivariate analog: A 2 F = i,j a2 ij for a matrix A) The distribution of ˆR X ˆR Y 2 HS intractable Perform dimension reduction, focus on the projected operators Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 26 / 48

27 Projection, truncation of the HS norm Let f 1,..., f K be some orthonormal L 2 functions Let K π K = f k f k k=1 be the projection operator onto the span of f 1,..., f K The test will based on ˆR K X ˆR K Y 2 HS where ˆR K X = π K ˆR X π K, ˆRK Y = π K ˆRY π K Need a common basis (the same π K for X, Y ) (a common reference coordinate system) We use π K = ˆπ K projecting on ˆϕ XY 1,..., ˆϕXY K (eigenfunctions of the pooled-sample estimator ˆR = n 1 n ˆRX + n 2 n ˆRY ) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 27 / 48

28 Test statistic The statistic is The terms ˆR K X ˆR K Y 2 HS = K ˆλ X,XY kj K k=1 j=1 ˆϕ XY k = ˆϕ XY k, ˆRX ˆϕ XY j, ˆλ Y,XY kj, ( ˆR X ˆR Y ) ˆϕ XY j 2 = ˆϕ XY k, ˆRY ˆϕ XY j are empirical var/cov of the X and Y scores w.r.t. the common basis Their asymptotic var/cov under H 0 is 2 δ kj λ k λ j The test statistic based on standardized components is T = n 1n 2 2n K K k=1 j=1 (ˆλ X,XY kj ˆλ Y,XY kj ) 2 ˆλ k ˆλ j Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 28 / 48

29 Asymptotics Under H 0 and Gaussian assumption, T D n χ2 K (K +1)/2 Sketch of the proof Using consistency of ˆϕ XY k, replace ˆπ K by π K By the Hilbert space CLT, ˆR X = 1 n 1 X i with X i = X i X i n 1 i=1 is asymptotically a Gaussian random operator (random element in the space of operators with Gaussian fdd s) Investigation of the covariance operator of the limit (an operator on operators on L 2 ) yields that the components of T are asymptotically independent Gaussian Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 29 / 48

30 Modifications of the test statistic Diagonal statistic T 1 = n 1n 2 2n K k=1 (ˆλ X,XY kk ˆλ Y,XY kk ) 2 ˆλ 2 k (compares only eigenvalues, might be good when eigenfunctions equal) Variance-stabilizing transformations log of the diagonal (variance) terms Fisher s z-transformation of the off-diagonal (covariance) terms Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 30 / 48

31 Selection of K 1 Scree plots, cumulative explained proportion of variance,... 2 Minimization of the penalized fit criterion PFC(K ) = GOF X (K )+GOF Y (K )+ n 1 n PEN X (K )+ n 2 n PEN Y (K ) (no formal result on the post-selection test) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 31 / 48

32 Simulations: design Simulated processes are combinations of Fourier basis functions with independent Gaussian coefficients Mimicking the elbow effect : 3 or 4 dominating components and several components with smaller variance n 1 = n 2 = 50 Nominal level 5 % Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 32 / 48

33 Simulations: level Scenario A Equal covariance operators K Test K off-diag diag The tests maintain the nominal level when K rank This is true also for K. The selection criterion aims at estimating the effective complexity of the distributions, it does not optimize the power, does not reflect validity or invalidity of H 0. Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 33 / 48

34 Simulations: power Scenario B The same sequence of eigenfunctions (in the same order) The first eigenvalues differ K Test K off-diag diag The power decreases (difference only in the first component) Diagonal better (the same eigenfunctions) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 34 / 48

35 Simulations: power Scenario E The same set of eigenfunctions in a different order (permuted) The first eigenfunctions equal The same eigenvalues K Test K off-diag diag No power with K = 1 (equal first eigenelements) Lower power for the diagonal test (the same eigenvalues) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 35 / 48

36 Simulations: power Scenario F Completely different eigenfunctions (sines vs time-shifted sines) Equal eigenvalues K Test K off-diag diag Lower power for the diagonal test Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 36 / 48

37 Outline 1 DNA minicircles 2 Functional Data Analysis background 3 Testing procedures 4 Analysis of DNA minicircles 5 Summary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 37 / 48

38 Outliers? Principal Axis 3, TATA Principal Axis 2, TATA Principal Axis 1, TATA Prin. Plane, TATA Principal Axis 3, CAP Principal Axis 2, CAP PAI3 PAI2 Principal Axis 1, CAP Prin. Plane, CAP PAI PAI2 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 38 / 48

39 Spatial median, outlier detection The functional spatial median is defined as the solution to min m L 2 n X i m 2 i=1 or n i=1 m X i m X i 2 = 0 The solution ˆm can be written as the weighted sum ˆm = n w i X i, w i 0, i=1 n w i = 1 i=1 Outliers have small weights w i, large values of 1/w i indicate outliers Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 39 / 48

40 Inverse spatial median weights Inverse median weights, PAI3 Index 1/w Inverse median weights, PAI2 Index 1/w Inverse median weights, PAI1 Index 1/w Inverse median weights, PAI1,2,3 Index 1/w Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 40 / 48

41 Outliers Principal Axis 3, TATA Principal Axis 2, TATA Principal Axis 1, TATA Prin. Plane, TATA Principal Axis 3, CAP Principal Axis 2, CAP PAI3 PAI2 Principal Axis 1, CAP Prin. Plane, CAP PAI PAI2 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 41 / 48

42 Joint PCA of PAI2,3: coordinates of eigenfunctions PAI2 in PAI2,3, phi 1 PAI2 in PAI2,3, phi 2 PAI2 in PAI2,3, phi 3 PAI2 in PAI2,3, phi PAI3 in PAI2,3, phi PAI3 in PAI2,3, phi PAI3 in PAI2,3, phi 3 PAI3 in PAI2,3, phi Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 42 / 48

43 Joint PCA of PAI2,3: eigencircles PCA function 1 (Percentage of variability 26.1 ) PCA function 2 (Percentage of variability 22.4 ) PCA function 3 (Percentage of variability 12.4 ) PCA function 4 (Percentage of variability 11 ) PCA function 1 (Percentage of variability 28.1 ) PCA function 2 (Percentage of variability 20.1 ) PCA function 3 (Percentage of variability 12.5 ) PCA function 4 (Percentage of variability 11.6 ) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 43 / 48

44 Joint PCA of PAI2,3: eigenvalues Lambda for PAI2, Var prop for PAI2, Cum var prop for PAI2, Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 44 / 48

45 Selection of K (joint analysis of PAI2,3) Scree plot w.r.t. the common (pooled sample) eigenbasis Lambda for PAI2, Var prop for PAI2, Cum var prop for PAI2, Plots suggest K = 6 or 7 Automatic choice K = 7 Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 45 / 48

46 Test results Test statistic: off-diagonal, transformed, χ 2 approximation p-value K PAI3 PAI2 PAI1 PAI2, S A S S A A (S = Selection based on scree plots, A = Automatic selection) Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 46 / 48

47 Outline 1 DNA minicircles 2 Functional Data Analysis background 3 Testing procedures 4 Analysis of DNA minicircles 5 Summary Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 47 / 48

48 Summary DNA minicircle data, alignment,... Functional data approach Tests based on an approximation of the Hilbert Schmidt distance between empirical covariance operators, asymptotics, simulations,... Minicircle data analaysis (outlier detection, order selection, testing,... ) Outlook Robustification Normality testing... Panaretos, Kraus, Maddocks (EPFL) Second-Order Inference for Gaussian Random Curves 48 / 48

Second-Order Comparison of Gaussian Random Functions and the Geometry of DNA Minicircles

Supplementary materials for this article are available online. Please click the JASA link at http://pubs.amstat.org. Second-Order Comparison of Gaussian Random Functions and the Geometry of DNA Minicircles