Research Statement on Statistics Jun Zhang

Similar documents
Asymptotic distribution of the largest eigenvalue with application to genetic data

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

25 : Graphical induced structured input/output models

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Association studies and regression

Causal Discovery by Computer

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Latent Variable models for GWAs

Methods for Cryptic Structure. Methods for Cryptic Structure

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Learning ancestral genetic processes using nonparametric Bayesian models

Graphs, Geometry and Semi-supervised Learning

Bayesian Inference of Interactions and Associations

A graph based approach to semi-supervised learning

Discriminative Direction for Kernel Classifiers

25 : Graphical induced structured input/output models

Pattern Recognition and Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Nonlinear Dimensionality Reduction. Jose A. Costa

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Introduction to Bioinformatics

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

Computational Genomics

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Locality Preserving Projections

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Data dependent operators for the spatial-spectral fusion problem

Mathematical models in population genetics II

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Overview of clustering analysis. Yuehua Cui

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Dimension Reduction (PCA, ICA, CCA, FLD,

Nonlinear Dimensionality Reduction

STA 4273H: Statistical Machine Learning

Diffusion/Inference geometries of data features, situational awareness and visualization. Ronald R Coifman Mathematics Yale University

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Variable Selection in Structured High-dimensional Covariate Spaces

Learning in Bayesian Networks

Statistical Pattern Recognition

ECE521 week 3: 23/26 January 2017

Finding High-Order Correlations in High-Dimensional Biological Data

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Mid-year Report Linear and Non-linear Dimentionality. Reduction. applied to gene expression data of cancer tissue samples

Calculation of IBD probabilities

Introduction. Chapter 1

Introduction to Gaussian Processes

Active and Semi-supervised Kernel Classification

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Unsupervised Learning Methods

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

PCA and admixture models

STA 414/2104: Lecture 8

Learning SVM Classifiers with Indefinite Kernels

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA 4273H: Sta-s-cal Machine Learning

LECTURE NOTE #3 PROF. ALAN YUILLE

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Overview. Background

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

STA 4273H: Statistical Machine Learning

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

Unsupervised machine learning

Data Mining Techniques

Latent Variable Methods for the Analysis of Genomic Data

An Introduction to Independent Components Analysis (ICA)

Prediction of double gene knockout measurements

CS534 Machine Learning - Spring Final Exam

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Data-dependent representations: Laplacian Eigenmaps

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Artificial Neural Networks

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

(Genome-wide) association analysis

STAT 518 Intro Student Presentation

CS281 Section 4: Factor Analysis and PCA

Manifold Regularization

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

arxiv: v1 [stat.ml] 3 Mar 2019

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

What is the expectation maximization algorithm?

Diffusion Models in Population Genetics

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Detection and characterization of interactions of genetic risk factors in disease

Hypes and Other Important Developments in Statistics

Transcription:

Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation of the patterns of usually high dimensional correlated and noisy data with complex structure, particularly when applied to genomics and medical images. Specifics include: Population Structure Visualization Correctly detecting population structure (PS) is critical for inference of human migration history and understanding the evolution. The confounding errors due to population structure in the rapidly planned disease association studies, i.e., false discoveries due to the systematic allele frequency differences among subpopulations, makes the issue urgent. The prevailing method to analyze PS is to use the top principal component (PC) of covariance matrix of subjects to summarize the global genetic variations across space. In [1, 2] from the point view of manifold learning, I propose using the Laplacian eigenfunctions to infer PS, instead of PCs. The idea is to construct an adjacent graph where each node represents one subject and it is connected only to its close neighbors, since subjects who are less correlated are usually meaningless. Then one can study the geometry of the intrinsic dependence graph. In particular, Laplacian eigenfunctions associated to the graph are the generalized harmonic functions which contain geometric information of the graph. Compared with PCA, our method is less noise and robust to outliers. Our method, LAPSTRUCT, is expected to become a promising tool for population structure detection and correction in disease association studies. In the collaborated work [3], the proposed approach has been successfully demonstrated on the speciation of global seagull using AFLP markers, whose result are perfectly consistent with other evolutionary evidence. Ancestral informative markers selection via Sparse Laplacian eigenfunctions Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. In [4], I propose a novel approach based on our recent result on summarizing population structure by graph Laplacian eigenfunctions. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population

sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by traditional methods. Association of rare variants with phenotype using Next-Generation Sequencing data Identification of the associated rare variants is important in the understanding the etiology of complex diseases. Studies have shown that a group of rare variants may explain a proportion of the genetic basis of diseases. In [5], we proposed a novel method motivated from Fisher s combined significance method to integrate the significance of each rare mutation and test the association with diseases at the functional unit level. Our approach is comparable to other prevailing methods such as weighted and combined methods in the literature. This is illustrated via the 1000 genome sequencing data. Our method works computionally fast and is suitable for disease studies at the coming genome sequencing scale. Diagnosis of Medical Images In computerized diagnosis of medical images such as CT colonography and digital mammography for breast cancer it is critical to have an efficient algorithm to distinguish malignant lesions from benign ones. One approach is to select a massive number of subregions as features from golden standard samples and train them according to their approximated likelihood of being malignant. In [6] we reduced the dimension of the features using manifold based techniques and trained the new features by a multi-layer Artificial Neural Network (ANN), which reduced the training time significantly while the statistical significance is maintained at the same sensitivity level. Planned Research in Near Future Cancer subtype classification via robust Principal Component using Copy Number Abberation data In [7] we develop a robust principal components based approaches to cluster cancer tumor subtypes using their DNA copy number profiles. We demonstrate the procedures on a breast cancer dataset whose gene expression profiles are also available to verify the conclusions. Random Matrix Theory for Dependent Data and Dependence Measure for Nonlinear Time Series The classical random matrix theory (RMT) gives a distribution of suitably normalized eigenvalues of covariance matrix for i.i.d Gaussian samples, namely, the Tracy-Widom law. However, the real data are always

dependent. A natural question comes up: is there a similar asymptotic distribution for dependent samples? It seems hopeless for the general dependent situation. Instead, I am investigating the question in a weak sense of dependence such as m-dependence, and the samples could be some observations of certain simple stationary time series. Estimation of Integrated Volatility using Levy processes My interest lies in using high frequency data to estimate the integrated volatility over some time periods. Let {S t } denote the price process of a security and suppose the log-return process {X t = logs t } follows an Ito process X t = µ t dt + σ t dl t, where L t is a stable Levy process with index α. We also incorporate the observation error into the estimating procedure with Y ti = X ti + ɛ ti, where X t is the latent true return process, and the ɛ ti are independent noise. Due to the market microstructure, it is known that sparse subsampling can reduce the variance of quadratic variation based estimator when the return process is driven by Brownian motion, that is α = 2. I investigate similar estimators in the general case 0 < α < 2 and demonstrate the advantages of more variability of Levy process allowing jumps. This is joint work with Wei-biao Wu. Statistical Manifold Learning Manifold methods have become increasingly important and popular in machine learning and have seen numerous recent applications in data analysis including dimensionality reduction, visualization, clustering and classification. The central modeling assumption in all of these methods is that the data resides on or near a low-dimensional submanifold in a higher-dimensional space. However, one does not have access to the underlying manifold but instead approximates it from a point cloud usually by constructing an adjacency graph. The underlying intuition has always been that since the graph is a proxy for the manifold, inference based on the structure of the graph corresponds to the desired inference based on the geometric structure of the manifold. We are exploring some theoretical results to justify this intuition. To be precise, earlier Nigoyi introduced a framework based on Laplacian Beltrami operator on a manifold to motivate using the graph Laplacian associated to point-cloud data, namely, Laplacian Eigenmap. Assuming M is a compact Riemannnian submanifold of R n, the operator M is defined as M f = div( f), where f C 2 (M). The eigenfunctions of Laplacian form a basis for L 2 (M), and play a central role in a variety of algorithms for data analysis. If the manifold is taken with a measure v (given by

dv(x) = P (x)dµ(x)) for some density function P (x) and with dµ being the canonical measure to the volume form, then the weighted Laplacian is defined as M,v f(x) = 1 div(p (x) P (x) Mf). Given data points {x 1,..., x n } sampled i.i.d from an arbitrary distribution P on M, we construct a weighted graph associated to the point cloud using Gaussian kernel. We define the point cloud Laplace operator by L t nf(x) = f(x) 1 n j e x x j 2 4t 1 n j f(x j )e x x j 2 4t We justify the following: Let t n = n 1 k+2+α, where α > 0 and let f C M, then the following equality holds: lim n 1 L t nf(x) = t n (4πt n ) k 2 1 vol(m) M,vf(x). Gene Regulatory Network and Graphical Models Identifying variations in DNA that increase susceptibility to disease is one of the primary aims of genetic studies using a forward genetics approach such as linkage and association testings. However, such studies provide limited functional information on how genes lead to diseases. An alternative is to identify gene networks that are perturbed by susceptibility loci and that in turn lead to diseases. Bayesian network has been recently employed as a tool to infer the interactions between genes. It is a graphical model of joint multivariate probability distributions that captures properties of conditional independence between variables. Given variouos genomic data such as genotyping, expression profiles, copy number abberations and sequencing data, I am interested in developing certain graphical models which can better learn transcriptional regulatory networks and infer causal relations from the noisy data. The gene regulatory network is actually also a dynamic network. With the additional time course gene expression data, it will be very valuable to combine tools from time series into the network framework. Another closely related direction is to develop certain graphical models based tools to incorporate the known biological pathway knowledge into association studies. References [1] Jun Zhang, Partha Niyogi, Mary Sara McPeek, Laplacian eigenfunctions learn population structure, PLoS One 2009, 4(12): e7928. doi:10.1371/journal.pone.0007928 [2] Jun Zhang, Chunhua Weng, Partha Niyogi, Graphical analysis of population structure on Rheumatoid arthritis data, BMC Proceeding 2009, 3(Suppl 7):S110

[3] Sternkopf V., Liebers-Helbig D., Ritz M., Zhang J., Helbig A and Knijff P, Introgressive hybridization and non-concordant evolutionary history of mitochondrial and nuclear DNA in the herring gull complex, BMC Evolutionary Biology 2010, 10:348doi:10.1186/1471-2148-10-348 [4] Jun Zhang, Ancestral informative marker selection and population structure visualization using sparse Laplacian eigenfunctions, PLoS ONE 2010 5(11): e13734. doi:10.1371/journal.pone.0013734 [5] Jun Zhang, Adam Olshen,Evaluating statistical approaches for rare variants association studies using resequencing data, under review, 2010 [6] Jun Zhang, Kenji Suzuki, Improved massive training ANN using principal components for computer aided detection of polyps in CT colonography,ieee Transaction on Medical Image 2010, 29: 1907-1917 [7] Jun Zhang, Adam Olshen, Cancer subtype classification via robust Principal Component using Copy Number Abberation data, in progress, 2010