Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation of the patterns of usually high dimensional correlated and noisy data with complex structure, particularly when applied to genomics and medical images. Specifics include: Population Structure Visualization Correctly detecting population structure (PS) is critical for inference of human migration history and understanding the evolution. The confounding errors due to population structure in the rapidly planned disease association studies, i.e., false discoveries due to the systematic allele frequency differences among subpopulations, makes the issue urgent. The prevailing method to analyze PS is to use the top principal component (PC) of covariance matrix of subjects to summarize the global genetic variations across space. In [1, 2] from the point view of manifold learning, I propose using the Laplacian eigenfunctions to infer PS, instead of PCs. The idea is to construct an adjacent graph where each node represents one subject and it is connected only to its close neighbors, since subjects who are less correlated are usually meaningless. Then one can study the geometry of the intrinsic dependence graph. In particular, Laplacian eigenfunctions associated to the graph are the generalized harmonic functions which contain geometric information of the graph. Compared with PCA, our method is less noise and robust to outliers. Our method, LAPSTRUCT, is expected to become a promising tool for population structure detection and correction in disease association studies. In the collaborated work [3], the proposed approach has been successfully demonstrated on the speciation of global seagull using AFLP markers, whose result are perfectly consistent with other evolutionary evidence. Ancestral informative markers selection via Sparse Laplacian eigenfunctions Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. In [4], I propose a novel approach based on our recent result on summarizing population structure by graph Laplacian eigenfunctions. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population

sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by traditional methods. Association of rare variants with phenotype using Next-Generation Sequencing data Identification of the associated rare variants is important in the understanding the etiology of complex diseases. Studies have shown that a group of rare variants may explain a proportion of the genetic basis of diseases. In [5], we proposed a novel method motivated from Fisher s combined significance method to integrate the significance of each rare mutation and test the association with diseases at the functional unit level. Our approach is comparable to other prevailing methods such as weighted and combined methods in the literature. This is illustrated via the 1000 genome sequencing data. Our method works computionally fast and is suitable for disease studies at the coming genome sequencing scale. Diagnosis of Medical Images In computerized diagnosis of medical images such as CT colonography and digital mammography for breast cancer it is critical to have an efficient algorithm to distinguish malignant lesions from benign ones. One approach is to select a massive number of subregions as features from golden standard samples and train them according to their approximated likelihood of being malignant. In [6] we reduced the dimension of the features using manifold based techniques and trained the new features by a multi-layer Artificial Neural Network (ANN), which reduced the training time significantly while the statistical significance is maintained at the same sensitivity level. Planned Research in Near Future Cancer subtype classification via robust Principal Component using Copy Number Abberation data In [7] we develop a robust principal components based approaches to cluster cancer tumor subtypes using their DNA copy number profiles. We demonstrate the procedures on a breast cancer dataset whose gene expression profiles are also available to verify the conclusions. Random Matrix Theory for Dependent Data and Dependence Measure for Nonlinear Time Series The classical random matrix theory (RMT) gives a distribution of suitably normalized eigenvalues of covariance matrix for i.i.d Gaussian samples, namely, the Tracy-Widom law. However, the real data are always

dependent. A natural question comes up: is there a similar asymptotic distribution for dependent samples? It seems hopeless for the general dependent situation. Instead, I am investigating the question in a weak sense of dependence such as m-dependence, and the samples could be some observations of certain simple stationary time series. Estimation of Integrated Volatility using Levy processes My interest lies in using high frequency data to estimate the integrated volatility over some time periods. Let {S t } denote the price process of a security and suppose the log-return process {X t = logs t } follows an Ito process X t = µ t dt + σ t dl t, where L t is a stable Levy process with index α. We also incorporate the observation error into the estimating procedure with Y ti = X ti + ɛ ti, where X t is the latent true return process, and the ɛ ti are independent noise. Due to the market microstructure, it is known that sparse subsampling can reduce the variance of quadratic variation based estimator when the return process is driven by Brownian motion, that is α = 2. I investigate similar estimators in the general case 0 < α < 2 and demonstrate the advantages of more variability of Levy process allowing jumps. This is joint work with Wei-biao Wu. Statistical Manifold Learning Manifold methods have become increasingly important and popular in machine learning and have seen numerous recent applications in data analysis including dimensionality reduction, visualization, clustering and classification. The central modeling assumption in all of these methods is that the data resides on or near a low-dimensional submanifold in a higher-dimensional space. However, one does not have access to the underlying manifold but instead approximates it from a point cloud usually by constructing an adjacency graph. The underlying intuition has always been that since the graph is a proxy for the manifold, inference based on the structure of the graph corresponds to the desired inference based on the geometric structure of the manifold. We are exploring some theoretical results to justify this intuition. To be precise, earlier Nigoyi introduced a framework based on Laplacian Beltrami operator on a manifold to motivate using the graph Laplacian associated to point-cloud data, namely, Laplacian Eigenmap. Assuming M is a compact Riemannnian submanifold of R n, the operator M is defined as M f = div( f), where f C 2 (M). The eigenfunctions of Laplacian form a basis for L 2 (M), and play a central role in a variety of algorithms for data analysis. If the manifold is taken with a measure v (given by

dv(x) = P (x)dµ(x)) for some density function P (x) and with dµ being the canonical measure to the volume form, then the weighted Laplacian is defined as M,v f(x) = 1 div(p (x) P (x) Mf). Given data points {x 1,..., x n } sampled i.i.d from an arbitrary distribution P on M, we construct a weighted graph associated to the point cloud using Gaussian kernel. We define the point cloud Laplace operator by L t nf(x) = f(x) 1 n j e x x j 2 4t 1 n j f(x j )e x x j 2 4t We justify the following: Let t n = n 1 k+2+α, where α > 0 and let f C M, then the following equality holds: lim n 1 L t nf(x) = t n (4πt n ) k 2 1 vol(m) M,vf(x). Gene Regulatory Network and Graphical Models Identifying variations in DNA that increase susceptibility to disease is one of the primary aims of genetic studies using a forward genetics approach such as linkage and association testings. However, such studies provide limited functional information on how genes lead to diseases. An alternative is to identify gene networks that are perturbed by susceptibility loci and that in turn lead to diseases. Bayesian network has been recently employed as a tool to infer the interactions between genes. It is a graphical model of joint multivariate probability distributions that captures properties of conditional independence between variables. Given variouos genomic data such as genotyping, expression profiles, copy number abberations and sequencing data, I am interested in developing certain graphical models which can better learn transcriptional regulatory networks and infer causal relations from the noisy data. The gene regulatory network is actually also a dynamic network. With the additional time course gene expression data, it will be very valuable to combine tools from time series into the network framework. Another closely related direction is to develop certain graphical models based tools to incorporate the known biological pathway knowledge into association studies. References [1] Jun Zhang, Partha Niyogi, Mary Sara McPeek, Laplacian eigenfunctions learn population structure, PLoS One 2009, 4(12): e7928. doi:10.1371/journal.pone.0007928 [2] Jun Zhang, Chunhua Weng, Partha Niyogi, Graphical analysis of population structure on Rheumatoid arthritis data, BMC Proceeding 2009, 3(Suppl 7):S110

[3] Sternkopf V., Liebers-Helbig D., Ritz M., Zhang J., Helbig A and Knijff P, Introgressive hybridization and non-concordant evolutionary history of mitochondrial and nuclear DNA in the herring gull complex, BMC Evolutionary Biology 2010, 10:348doi:10.1186/1471-2148-10-348 [4] Jun Zhang, Ancestral informative marker selection and population structure visualization using sparse Laplacian eigenfunctions, PLoS ONE 2010 5(11): e13734. doi:10.1371/journal.pone.0013734 [5] Jun Zhang, Adam Olshen,Evaluating statistical approaches for rare variants association studies using resequencing data, under review, 2010 [6] Jun Zhang, Kenji Suzuki, Improved massive training ANN using principal components for computer aided detection of polyps in CT colonography,ieee Transaction on Medical Image 2010, 29: 1907-1917 [7] Jun Zhang, Adam Olshen, Cancer subtype classification via robust Principal Component using Copy Number Abberation data, in progress, 2010