Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Similar documents
Introduction to Linkage Disequilibrium

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

p(d g A,g B )p(g B ), g B

Private Computation with Genomic Data for Genome-Wide Association and Linkage Studies

Genotype Imputation. Biostatistics 666

How to analyze many contingency tables simultaneously?

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies

(Genome-wide) association analysis

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Bayesian Inference of Interactions and Associations

Genotype Imputation. Class Discussion for January 19, 2016

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Learning ancestral genetic processes using nonparametric Bayesian models

Supplementary Figures

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies

Computational Approaches to Statistical Genetics

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

An Integrated Approach for the Assessment of Chromosomal Abnormalities

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

An Integrated Approach for the Assessment of Chromosomal Abnormalities

Nature Methods: doi: /nmeth Supplementary Figure 1

2. Map genetic distance between markers

Calculation of IBD probabilities

Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

BTRY 7210: Topics in Quantitative Genomics and Genetics

Régression en grande dimension et épistasie par blocs pour les études d association

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Backward Genotype-Trait Association. in Case-Control Designs

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Supporting Information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Linear Regression (1/1/17)

Statistical Methods in Mapping Complex Diseases

ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG Human Population Genomics

Fei Lu. Post doctoral Associate Cornell University

Accounting for read depth in the analysis of genotyping-by-sequencing data

SAT in Bioinformatics: Making the Case with Haplotype Inference

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Phasing via the Expectation Maximization (EM) Algorithm

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Lecture WS Evolutionary Genetics Part I 1

Linkage and Linkage Disequilibrium

Frequency Spectra and Inference in Population Genetics

Lab 12. Linkage Disequilibrium. November 28, 2012

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Computational Systems Biology: Biology X

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Cover Page. The handle holds various files of this Leiden University dissertation

Calculation of IBD probabilities

Optimal Methods for Using Posterior Probabilities in Association Testing

Feature Selection via Block-Regularized Regression

GWAS for Compound Heterozygous Traits: Phenotypic Distance and Integer Linear Programming Dan Gusfield, Rasmus Nielsen.

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Enabling Accurate Analysis of Private Network Data

Auditing Information Leakage for Distance Metrics

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

SNP Association Studies with Case-Parent Trios

opulation genetics undamentals for SNP datasets

Population Genetics II (Selection + Haplotype analyses)

MBeacon: Privacy-Preserving Beacons for DNA Methylation Data

Association studies and regression

1. Understand the methods for analyzing population structure in genomes

Supplementary Information for: Detection and interpretation of shared genetic influences on 42 human traits

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Inference From Genome-Wide Association Studies Using a Novel Markov Model

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Some models of genomic selection

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Genetic Drift in Human Evolution

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

An Efficient and Accurate Graph-Based Approach to Detect Population Substructure

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

The supplementary document of LLR: A latent low-rank approach to colocalizing genetic risk variants in multiple GWAS

1 Preliminary Variance component test in GLM Mediation Analysis... 3

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Case-Control Association Testing. Case-Control Association Testing

A differential equation model for functional mapping of a virus-cell dynamic system

A mixed model based QTL / AM analysis of interactions (G by G, G by E, G by treatment) for plant breeding

Transcription:

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang and Xiaoyong Zhou Indiana University at Bloomington

Genomic Revolutions Low-cost genotyping Revolutionary applications

Genome-Wide Association Study Case Group Control Group Single Nucleotide Polymorphism (SNP)

Identification Risk Consequence of identifications Participant protection De-identification Aggregation Is this sufficient?

Attack on Aggregated Data Single-allele frequencies Major: 0; Minor: 1 Homer s attack NIH s Reactions

The Rest of The Iceberg Other genome data Test statistics Linkage Disequilibrium (LD) Haplotype sequences Other sources Publications

Our Scary Findings ID from GWAS publications Test statistics LD statistics Allele Frequencies Statistical Identification Pair-wise allele frequencies SNP Sequences Work on real genome data Conclusion: Urgent needs to thoroughly study the problem

Why Doing This? Facilitate Dissemination of Genome Data SAFELY A Lesson From the Internet: Build Protection Into the Core!

Terms Alleles Single (0 1) Pair-wise (00, 01, 10, 11) Genotype Combinations of two sets of alleles Haplotype SNP Sequence (phased genotype) Locus Surrounding region of a SNP site

GWAS: Backgrounds GWAS Study Quality Control Info Leaks p values: leak joint allele frequencies of case & control Association Detection p values: leak case frequencies & control frequencies LD in Regions of Association r 2 : reveal LD of SNP sequences Replication Disclose cohorts with same frequency distributions

Homer s Attack Reference Group (Pop) Case Group (M) Pop j : 0.3 Pop j+1 : 0.6 Pop j+2 : 0.3 Y j : 1 Y j+1 : 0 Y j+2 : 1 Y i Pop i Y i M i M j : 0.8 M j+1 : 0.2 M j+2 : 0.6 H 0 Not in M D(Y i ) = Y i Pop i - Y i M i ΣD

What we can do Reverse engineer test statistics To find allele frequencies LD-based statistical identification Recover SNP sequences

Allele Frequency (Single) SNP 1 SNP2 SNP3 r 2 (1,3) C r 2 0* (1,2) r 2 (2,3) p 2 p 3

Allele Frequencies (Pair-wise) 2 2 (C00N *0C 2 (C N C C 0* ) 00 *0 0* ) L r< = < U C C C C C*1 C C C C 0* 1* *0 *1 = C = C = C = C 0* 0* 1* 00 10 00 01 + C + C 1* *0 + C + C 01 11 10 11 *0 *1 (1) (2) (3) (4) (5) Catch: C 00 not unique Integer constraint Inaccurate r-squares Signs

Homer-Style Attack Based On LD? Why? Single AF: n LD: n(n-1)/2 But how? Validity of the test statistic D(Y i ) = Y i Pop i - Y i M i r 2 2 (C00C11 C10C01) = C 0* C 1* C *0 C *1

Our Statistical Attack We have to use signed r Distribution of T r? Markov model Reference? T T r = ( Y = 00 + Y T 1 i j N 11 ) ( r R + 1) / 2 ( Y 00 + Y 11 ) ( r C + 1) / 2 = ( r C r R )( Y 00 + Y 11 Y 01 Y 10 )

Recover SNP Sequences Contingency table problem Studied for decades Very difficult Divide-and-Conquer 1. Construct each haplotype block 2. Connect different blocks

Simple Defense Low-precision statistics Correlation among SNPs Thresholds How to determine them? Noises Consistency check Maximum-likelihood approximation

Evaluations Data: the HapMap project Locus: FGFR2 174 SNPs Used in a real GWAS study Population Africa backgrounds 200: half cases and half controls

Allele Frequencies and Signs

Statistical Powers 20 times more powerful than Homer s test (T p )

Recover Haplotypes Linear equation solving: rref Integer Programming: bintprog 100 individuals, 10 blocks, 174 SNPs System: 2.80GHz Core 2 Duo, 3GB memory Fully restored within 12 hours

Discussion Genotypes vs. Haplotypes Defense Differential privacy

Conclusion New attacks and new understanding Many open research problems

Contacts Dr. XiaoFeng Wang 812-856-1862 Web: www.informatics.indiana.e du/xw7 System Security Lab: sysseclab.informatics.india na.edu Dr. Haixu Tang 812-856-1859 Web: www.informatics.indiana.e du/hatang

References Good: from the same population Bad: from different populations good reference average reference

More In-depth Studies Larger populations: Low-precision statistics (200 cases, 200 references)