From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch

Size: px

Start display at page:

Download "From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch"

Barnaby Scott
6 years ago
Views:

1 From Histograms to Multivariate Polynomial Histograms and Shape Estimation Assoc Prof Inge Koch Statistics, School of Mathematical Sciences University of Adelaide Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 1 / 27

2 Motivation: determine the shape of data We have 12 measurements on each of 27,994 blood cells How many cluster? How big are they and where are they? Data: Centre for Immunology, St Vincent Hospital, Sydney Immunologists want to differentiate between healthy individuals from those with HIV +. Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 2 / 27

3 Look at the (Log-Data) 2 blood cells 4 blood cells 8 CD3 5 CD CD CD4 2 5 CD CD4 1 blood cells blood cells CD CD4 CD8 CD CD4 CD8 Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 3 / 27

4 Histograms of the (Log-Data) 2 CD3 1 bins 4 CD3 5 bins Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 4 / 27

5 Histograms of the (Log-Data) Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 5 / 27

6 How Many Cluster are in the Data? One-dimensional data: 1 or 2 modes; Two-dimensional data: 1 to 3 or 4 modes; How many clusters are in the 12-dimensional data? If the measurements were independent, then the number of modes would be the product but this is not the case in our data Can you think of a 3D example with k modes such that the 2D projections have k 1 modes? Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 6 / 27

7 Polynomial Histogram Estimators Main idea histograms have flat tops, so instead of only estimating the number of points in each bin estimate the shape separately in each bin Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 7 / 27

8 What are Polynomial Histogram Estimators? Number of observations n, dimension d, binwidth h B l = h d a bin with n l observations The model for each bin B l 1 histogram estimators (Hist) f (x) = a 2 first-order polynomial histogram estimator (Fophe) f 1 (x) = a + a T x 3 second-order polynomial histogram estimator (Sophe) f 2 (x) = a + a T x + x T Ax Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 8 / 27

9 Relationships for Coefficients In each bin B l the estimate f k satisfies 1 proportion of data 2 local mean 3 local second moment B l f k (x)dx = n l n B l xf k (x)dx = n l n x l B l xx T f k (x)dx = n l n M l Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 9 / 27

10 The New Estimators In each bin B l with bin centre t l Fophe Sophe f1 (x) = 1 n l [ h ( x h d+2 l t l ) T (x t l ) ] n f2 (x) = 1 n l h d+4 n { (4 + 5d) h 4 15h 2 tr(s l ) + 12h 2 (x t l ) T ( x l t l ) 4 + (x t l ) T [ 72S l + 18 diag(s l ) 15h 2 I ] (x t l ) }. Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 1 / 27

11 Roederer Data: 1, observations, CD4 & CD8 Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

12 The performance of estimators We assess the performance of estimators with the MSE. Let θ be an estimator for a true quantity θ. Then MSE( θ) = [ bias( θ)] 2 + var( θ) bias( θ) = E θ θ var( θ) = { 2 ] [ ] 2 E [ θ E θ]} = E [ θ2 E θ Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

13 Sophe s Performance For a fixed point x B l we want the bias of f = f 2 at x Consider ] ( 1 n l E [ f (x) = E h d+4 n { (4 + 5d) h 4 15h 2 tr(s l ) + 12h 2 (x t l ) T ( x l t l ) 4 + (x t l ) [ T 72S l + 18 diag(s l ) 15h 2 I ] (x t l ) }) Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

14 Some Expectation Calculations I We show that and so [ 12h 2 E h (x t l) T n ] l d+4 n ( x l t l ) [ nl ] E n ( x l t l ) = (y t l )f (y)dy B l = 12 h (x t l) T d+2 B l (y t l )f (y)dy then use a Taylor expansion of f about the bin centre t l f (y) = f (t l ) + (y t l )Df (t l ) (y t l) 2 D 2 f (t l ) (y t l) 3 D 3 f (t l ) + o ( y t l 3) Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

15 Some Expectation Calculations II The first non-zero integral gives [ 12 E h (x t l) T n ] l d+2 n ( x l t l ) (x t l ) T Df (t l ) ] We prove similar results for all terms contributing to E [ f (x)... and finally get Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

16 The Bias E[ f (x)] = f (t l ) + (x t l ) T Df (t l ) (x t l) 2 D 2 f (t l ) ( + h2 12 (x t l) T i f uii f ) uuu + o(h 3 ) 2 5 Taylor expansion of f about the bin centre t l f (x) = f (t l ) + (x t l )Df (t l ) (x t l) 2 D 2 f (t l ) (x t l) 3 D 3 f (t l ) + o ( x t l 3) so bias[ f (x)] depends on difference of 3 rd order derivatives Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

17 Moving on... and making some big leaps We have the following steps in the performance calculations 1 pointwise bias and variance MSE at f (x) 2 integrated squared bias and integrated variance of f over all x 3 finally some asymptotics when n We want to know how Fophe and Sophe depend on the sample size n, the binwidth h, and the dimension d Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

18 How Good are Fophe and Sophe Bias 2 Variance Rate of Convergence hist C H h 2 1 nh d kernel C K h 4 R(K) nh d fophe C F h 4 d + 1 nh d sophe C S h 6 (d + 1)(d + 2) 2nh d n 2/(d+2) n 4/(d+4) n 4/(d+4) n 6/(d+6) Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

19 Performance for 2, 1 and 1 Observations 5 x x x hist Fophe Sophe kernel Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

20 27,994 obs: Kernel est. takes 92 Sophe Inge Koch (UNSW, Adelaide) Poly Histograms 19 March 29 2 / 27

21 Advantages of Fophe and Sophe Computational advantages 1 a smaller number of bins is required 2 number of bins only needs to be approximately correct Sophe better than Fophe in visual and computational aspects use Sophe for data Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

22 Finding Modes with the Sophe 1 Fix binwidth h, # of bins ν bin, thresholds θ, and κ. 2 Find bins with high density. 1 Find n l in each bin, and discard bins that contain fewer than θ observations. Let B = {B l : n l > θ }. 2 Sort bins in B by # of observations, starting with largest. 3 Determine modes from B using (1) or (2) below. 1 For i, j = 1..., κ calculate pairwise distances (i,j) between the bin centres. For i consider the set of nearest neighbours nn (i) = { ( (i,j), n (j) ) : (i,j) h }. B (i) contains a mode, if n (i) is maximum over nn (i). 2 If matrix A (j) is negative definite, then B (j) contains a mode. Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

23 Look at the (Log-Data) 2 blood cells 4 blood cells 8 CD3 5 CD CD CD4 2 5 CD CD4 1 blood cells blood cells CD CD4 CD8 CD CD4 CD8 Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

24 Modes for 12-Dimensional Data Use 5 bins in each variable compare # of modes and % of non-empty bins # variables # modes # of bins % non-empty CDs 3,4, CDs 14, 19, all ,14, Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

25 The End J Jing, I Koch and K Naito (29). Polynomial Histograms for Multivariate Density and Mode Estimation preprint. Thank you Inge Koch (UNSW, Adelaide) Poly Histograms 19 March / 27

SOPHE: Second order polynomial histogram estimators for density estimation and clustering with applications to flow cytometry.

SOPHE: Second order polynomial histogram estimators for density estimation and clustering with applications to flow cytometry Inge Koch School of Mathematical Sciences The University of Adelaide and Australian