Size: px
Start display at page:

Download ""

Transcription

1 Review We have covered so far: Single variant association analysis and effect size estimation GxE interaction and higher order >2 interaction Measurement error in dietary variables (nutritional epidemiology) Today s lecture: set-based association

2 Set-based association analysis i = 1,, n m variants on a certain region Genotype G i = (G i1, G i2,, G im ), g ij = 0, 1, 2 Covariates X i : intercept, age, sex, principal components for population structure. Model: where g( ) is a link function. m g{e(y i )} = X i α + G ij β j j=1 No association for a set means β = (β 1,, β m ) = 0

3 Why? In gene expression studies, a list of differentially expressed genes fails to provide mechanistic insights into the underlying biology to many investigators Pathway analysis extracts meaning by grouping genes into small sets of related genes Function databases are curated to help with this task, e.g., biological processes in which genes are involved in or interact with each other. Analyzing high-throughput molecular measurements at the functional or pathway level is very appealing Reduce complexity Improve explanatory power than a simple list of differentially expressed genes

4 Why? Variants in a defined set (e.g. genes, pathways) may act concordantly on phenotypes. Combining these variants aggregates the signals; as a result, may improve power Power is particularly an issue when variants are rare. The challenge is that not all variants in a set are associated with phenotypes and those who are associated may have either positive or negative effects

5 Outline Burden test Variance component test Mixed effects score test

6 Burden Test If m is large, multivariate test β = 0 is not very powerful Population genetics evolutionary theory suggests that most rare missense alleles are deleterious, and the effect is therefore generally considered one-sided (Kryukov et al., 2007) Collapsing: Suppose β 1 = = β m = η g{e(y i )} = X i α + B iη B i = m j=1 g ij : genetic burden/score. With weight (adaptive burden test) B i = m w j G ij. j=1 Test H 0 : η = 0 (d.f.=1).

7 Weight Threshold-based method { 1 if MAF j t w j (t) = 0 if MAF j > t Burden score B i (t) = m j=1 w j(t)g ij, and the corresponding Z-statistic is denoted by Z (t) Variable threshold (VT) test Z max = max t Z (t) P-value can be calculated by permutation (Price et al 2010) or numerical integration using normal approximation. P(Z max z) = 1 P(Z max < z) = 1 P(Z (t 1 ) < z,, Z (t b ) < z) where {Z (t 1 ),, Z (t b )} follows a multivariate normal distribution MVN(0, Σ).

8 Weight Variant effects can be positive or negative and the strength can be different too. Fit the marginal model for each variant g{e(y i )} = X i α + G ijγ j

9 Weight Adaptive sum test (Han & Pan, Hum Hered 2010) { 1 if γ j < 0 and p-value < α 0 ; w j = 1 otherwise If α 0 = 1, the weight is the sign of γ j, but the corresponding weighted burden test has low power because the null distribution has heavy tails. α 0 is chosen such that only when H 0 likely does not hold, the sign is changed if γ j is negative. The authors suggest α 0 = 0.1, but it is data dependent

10 Weight Estimated regression coefficient (EREC) test (Lin & Tang Am J Hum Genet 2011) Score statistic T EREC = 1 n w j = γ j + c, for c 0 n m {( ( γ j + c)g ij )(y i µ i ( α))} N(0, Σ) i=1 j=1 µ i ( α) is estimated under the null of no association If c = 0, T EREC = 1 n n i=1 {( m j=1 γ jg ij )(y i µ i ( α))} is not asymptotically normal c = 1 for binary traits, c = 2 for standardized quantitative traits Compute p-values using permutation.

11 Burden Tests Burden tests lose a significant amount of power if there are variants with different association directions or a large # of variants are neutral. Adaptive burdent tests have robust power, but they rely on resampling to compute p-values. Computationally intensive, not suitable for genome-wide discovery.

12 Variance Component Test Model m g{e(y i )} = X i α + G ij β j j=1 Burden tests are derived assuming β 1 = = β m. Variance component test Assume βj F(0, τ 2 ), where F( ) is an arbitrary distribution, and the mean of β j s = 0. H0 : β 1 = = β p = 0 H 0 : τ 2 = 0.

13 Derivation of Variance Component Test Suppose g( ) is linear and Y X, G Normal. That is, m y i = X i α + G ij β j + ε, ε N(0, σ 2 ) j=1 Suppose β j Normal (0, τ 2 ), j = 1,..., m Marginal model: Y n 1 MVN(X n p α, τ 2 G n m G m n + σ 2 I)

14 Derivation of Variance Component Test Log likelihood l = (Y Xα) (τ 2 GG + σ 2 I) 1 (Y Xα) 2 Let V (τ 2 ) = τ 2 GG + σ 2 I Score function 1 2 log τ 2 GG + σ 2 I n 2 log(2π) l τ 2 = (Y Xα) V (τ 2 ) 1 GG V (τ 2 ) 1 (Y Xα) 2 tr(v (τ 2 ) 1 (GG )) 2

15 Derivation of Variance Component Test Score test statistic Q = l τ 2 τ=0 = 1 2 (Y Xα) V 1 GG V 1 (Y Xα) 1 2 tr(v 1 (GG )) = 1 2 (Y Xα) M(Y Xα) 1 2 tr(v 1 2 MV 1 2 ) M = V 1 GG V 1, V = σ 2 I

16 Derivation of Variance Component Test Q is not asymptotically normal Q = 1 2 (Y Xα) M(Y Xα) 1 2 tr(v 1 2 MV 1 2 ) = 1 2Ỹ (V 1 2 MV 1 2 ) Ỹ 1 2 tr(v 1 2 MV 1 2 ) where Ỹ = V 1 2 (Y Xα) N(0, I) Let {λ j, u j, j = 1,..., m} be the eigenvalues and eigenvectors of V 1 2 MV 1 2. Then Q = m λ j ((u j Ỹ )2 1) = j=1 Q is not asymptotically normal Zhang and Lin (2003) show that Ỹ (V 1 2 MV 1 2 ) Ỹ m j=1 m λ j χ 2 1,j j=1 λ j (Z 2 j 1)

17 Variance Component Test The exact probability associated with a mixture of χ 2 distributions is difficult to calculate. Satterthwaite method to approximate the distribution by a scaled χ 2 distribution, κχ 2 ν, where κ and ν are calculated by matching the first and second moments of the two distributions. To adjust for α, replace V 1 by projection matrix P = V 1 V 1 X(X V 1 X) 1 X V 1.

18 General Form of Variance Component Test Linear model y i = X i α + h(g i ) + ε, ε N(0, σ 2 ) h( ) is a centered unknown smooth function H generated by a positive definite kernel function K (, ). K (, ) implicitly specifies a unique function space spanned by a set of orthogonal basis functions {φ j (G), j = 1,..., J} and any h( ) can be represented by linear combination of these basis h(g) = J j=1 ζ jφ j (G)(the primal representation)

19 General Form Equivalently, h( ) can also be represented using K (, ) as h(g i ) = n j=1 ω jk (G i, G j ) (the dual representation) For a multi-dimensional G, it is more convenient to specify h(g) using the dual representation, because explicit basis functions might be complicated to specify, and the number might be high

20 Estimation Penalized likelihood function (Kimeldorf and Wahba, 1970) l = 1 2 n n y i X i α ω j K (G i, G j ) i=1 j= λω K ω where λ is a tuning parameter which controls the tradeoff between goodness of fit and complexity of the model α = { X (I + λ 1 K ) 1 X} 1 X (I + λ 1 K ) 1 y and ω = λ 1 (I + λ 1 K ) 1 (y X α) ĥ = K ω

21 Connection with Linear Mixed Models The same estimators can be re-written as [ X V 1 X X V 1 ] [ ] α V 1 X V 1 + (τk ) 1 = h [ X V 1 y V 1 y ] where τ = λ 1 σ 2 and V = σ 2 I Estimators α and ĥ are best linear unbiased predictors under the linear mixed model y = X α + h + ε where h is a n 1 vector of random effects with distribution N(0, τk ) and ε N(0, V )

22 General Form of Variance Component Test Testing H 0 : h = 0 is equivalent to testing the variance component τ as H 0 : τ = 0 versus H 1 : τ > 0 The REML under the linear mixed model is l = 1 2 log V (τ 2 ) 1 2 X V 1 (τ 2 )X Score statistic for H 0 : τ 2 = 0 is 1 2 (y X α) V (τ 2 ) 1 (y X α) Q = (Y X α) K (Y X α) tr(kp), which follows a mixture of χ 2 1 distribution.

23 Kernel Kernel function K (, ) measures similarity for pairs of subjects Linear kernel: K (G i, G k ) = m j=1 G ijg kj Something about K (, ) Ability to incorporate high-dimension and different types of features (e.g., SNPs, expression, environmental factors) K (, ) is a symmetric semipositive definite matrix Eigenvalues are interpreted as % of the variation explained by the corresponding eigenvectors, but a negative eigenvalue implying negative variance is not sensible. No guarantee that the optimization algorithms that work for positive semidefinite kernels will work when there are negative eigenvalues Mathematical foundation moves from real numbers to complex numbers

24 Some Kernels Some kernels K (G i, G k ) = m j=1 G ijg kj =< G i, G k > K (G i, G k ) = 1 2m m j=1 IBS(G ij, G kj ), where IBS is identity-by-state K (G i, G k ) = (< G i, G k >) p : polynomial kernel, p > 0 Modeling higher-order interaction m (< G i, G k >) 2 = ( G ij G kj ) 2 = j=1 m m (G ij G ij )(G kj G kj ) j=1 j =1 K (Gi, G k ) = exp( G i G j 2 /σ 2 ): Gaussian kernel Schaid DJ. (2010) Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 70: Schaid DJ. (2010) Genomic similarity and kernel methods II: methods for genomic information. Hum Hered 70:

25 Choice of Kernels An advantage of the kernel method is its expressive power to capture domain knowledge in a general manner. Generally difficult to construct a good kernel for a specific problem Basic operations to create new kernels from existing kernels: multiplying by a positive scalar adding kernels multiplying kernels (element-wise).

26 Generalized Linear Model Observations for the linear model apply to the generalized linear model Penalized log-likelihood function n n l = y i (X i α + ω j K (G i, G j )) i=1 j=1 n log{1 + exp(x i α + ω j K (G i, G j ))} 1 2 λω K ω j=1 The logistic kernel machine estimator [ X DX X ] [ ] D α DX D + (τk ) 1 = h [ X Dỹ Dỹ ] where τ = λ 1 σ 2, D = diag{e(y i )(1 E(y i ))}, and ỹ = Xα + K ω + var(y) 1 (y µ)

27 Generalized Linear Model The same estimators can be obtained from maximizing the penalized quasilikelihood from a logistic mixed model logite(y i ) = X i α + h i where h = (h 1,..., h n ) is a n 1 vector of random effects following h N(0, τk ) with τ = 1/λ The score statistic for τ is Q = (y X α) K (y X α), which follows a mixture of χ 2 distributions

28 Exponential Family Suppose y i follows a distribution in the exponential family with density p(y i ; θ i, φ) = exp{ y iθ i a(θ i ) φ + c(y i, φ)}, where θ i = X i α + h(g i) is the canonical parameter, a( ) and c( ) are known functions, φ is a dispersion parameter µ i = E(y i ) = a (θ i ) and var(y i ) = φa (θ i ) Gaussian: φ = σ 2, a(θ i ) = θ 2 i /2, and a (θ i ) = θ i Logistic: φ = 1, a(θ i ) = log(1 + exp(θ i )), a (θ i ) = exp(θ i ) 1+exp(θ i ) Other distributions: log-normal, Poisson, etc.

29 Summary Burden tests are more powerful when a large number of variants are causal and all causal variants are harmful or protective. Variance component test is more powerful when a small number of variants are causal, or mixed effects exist. Both scenarios can happen across the genome and the underlying biology is unknown in advance.

30 Combined Test SKAT (SNP-set/Sequence Kernel Association Test): variance component test Combine the SKAT variance component and burden test statistics (Lee et al. 2012) Q ρ = (1 ρ)q SKAT + ρq burden, 0 ρ 1 ρ = 0: SKAT ρ = 1: Burden Instead of assuming {β j } are iid from F(0, τ 2 ), assume β 1. β m F 1 ρ... ρ 0, τ ρ... 1

31 SKAT-O Q ρ = (1 ρ)q SKAT + ρq burden, which is asymptotically equivalent to (1 ρ)κ + a(ρ)η 0, where κ follows a mixture of χ 2 1 and η 0 χ 2 1. Use the smallest P-value from different ρs: T = inf 0 ρ 1 P ρ In practice, evaluate Q ρ on a set of pre-selected grid points, 0 = ρ 1 < < ρ B = 1 T = min ρ {ρ1,,ρ B }P ρ

32 Summary Have robust power under a wide range of models Q SKAT and Q burden are not independent. The underlying model for SKAT-O is not natural.

33 Mixed Effects Model Model m g{e(y i )} = X i α + G ij β j (1) j=1 Burden: β1 = = β m SKAT: βj F (0, τ 2 ) independently SKAT-O: βj F(0, τ 2 ) with pairwise correlation ρ Hierarchical model of β β j = w j η + δ j (2) wj : known features for the jth variant (e.g., w j = 1 for all j s) δj F (0, τ 2 )

34 Mixed Effects Model Plug (2) into (1) m g{e(y i )} = X i α + ( m w j G ij )η + G ij δ j j=1 j=1 Some examples: If w = 0, δj = β j, the model becomes m g{e(y i )} = X i α + G ij β j, β j F(0, τ 2 ) j=1 If w = 1 and δj = 0, the model becomes m g{e(y i )} = X i α + ( G ij )η j=1

35 Mixed Effects Model Some examples: wj = (w j1, w j2 ) where m w j1 = { 1 for j = 1,, m 1 if jth variant is a missense w j2 = 0 otherwise j=1 w jg ij = ( m j=1 G ij, m j=1 w j2g ij ) η1 : average effect of m variants η 2 : effect of missense variants relative to the average δj : residual variant specific effect F(0, τ 2 )

36 Mixed Effects Model-based Test Mixed effects model m g{e(y i )} = X i α + ( m w j G ij )η + G ij δ j j=1 j=1 Null hypothesis is H 0 : η = 0 and τ 2 = 0 η: fixed effects; τ 2 : variance component The score test statistic for τ 2 and η is S η = (Y X α) (GW )(GW ) (Y X α), and S τ 2 = ( GG Y X α) ( ) Y X α, where α is MLE of α under H 0. However, S τ 2 and S η are not independent.

37 Independence of score test statistics We made a minor (but important) modification S τ 2 = ( GG Y X α GW η) ( ) Y X α GW η, where ( α T, π T ) are obtained under τ 2 = 0. We can show that S τ 2 and S η are independent. G } E{(GW ) (Y X α)((y X α GW η) = σ 2 E{(GW ) (I P1)(I P2)G} = 0, where P 1 is the projection onto X and P 2 is the projection onto (X, (GW )).

38 Combining independent statistics MiST (Mixed effects Score Test) P-value combination Fisher s combination: reject H0 at significance level α if 2 log(p τ 2) 2 log(p η ) χ 2 4,α Tippitt s combination: reject H 0 at significance level α if min(p τ 2, P η ) 1 (1 α) 1/2 Other combinations, e.g., linear combination S = ρs η + (1 ρ)s τ 2 Jianping Sun, Yingye Zheng, and Li Hsu (2013). A Unified Mixed-Effects Model for Rare-Variant Association in Sequencing Studies. Genetic Epidemiology, 37:

39 Power Comparison m=10 variants, n=1000 subjects, α = 0.01 Burden SKAT SKAT-O MiST F MiST T β j = c/{p j (1 p j )} 1/2, j = 1,..., β 3 = 1.5c, β 4 = 1.5c, β 5 = c, β 6 = c; β 1 = β 4 = β 7 = c β 1 = c, β 4 = 0.5c, β 7 = 0.25c

40 Dallas Heart Study Dallas Heart Study (Victor et al. 2004). n=3409 subjects, 3 genes (ANGPTL3, ANGPTL4 and ANGPTL5) were sequencied. We analyzed these genes in association with log(triglyceride). ANGPTL3 ANGPTL4 ANGPTL5 Burden SKAT SKAT-O EREC MiST F MiST T MiST F (Z ) MiST T (Z ) The component p-values of ANGPTL5: p π = 5x10 6 and p τ 2 = Furthermore, p = for nonsense variants and p=0.24 for frame shift variants.

41 Summary MiST (Mixed effects Score Test) is based on hierarchical models for a set of variants The model includes the usual appealing features for regression models such as adjusting for confounders and being able to accommodate different types of outcomes by using appropriate link functions. It models the variant effects as a function of (known) variant characteristics to leverage information across loci while still allowing for individual variant effects.

42 Combining K studies We have discussed for single variant analysis: Pooling the data from K studies. Since all score statistics are derived from regression models, it is easy to account for the differences between studies by adjusting for study and/or study covariates Pooling the data ensures consistency in data QC and model fitting Pooling can be logistically difficult and time consuming Sometimes protection of human subjects prohibit sharing the data Meta-analysis of combining summary statistics from K studies is still a viable alternative

43 Revisit Score Statistics Weighted burden test n m U burden = w j G ij (y i X i α) = i=1 m j=1 n w j j=1 i=1 G ij (y i X i α) }{{} U j : Score of single variant model U j = n i=1 G ij(y i X i α) is a score function of a single variant model. y i = X i α + G ij β j + ε i, ε N(0, σ 2 )

44 Variance Component test Q SKAT is a weighted sum of squared score statistics of the single SNP marginal model. Q SKAT = (Y X α) GG m n(y X α) n 1 m n = { G ij (Y i X i α)} 2 = j=1 m j=1 i=1 U 2 j

45 Key Elements A vector of single variant score statistics, U = (U 1,..., U m ) with covariance V = cov(u) Burden score statistic U burden = W U, var(u burden ) = W VW Variance component score statistic Q SKAT = U U, which follows a mixture of chi 2 distribution with weights being the eigenvalues of V

46 Fixed effects model For k = 1,, K, let U k and V k denote the score statistics and covariance for the kth study. Score statistic over K studies is Burden test U = k U k V = k=1 k k=1 V k U burden = W U var(u burden ) = W VW U burden var(u burden) 1 U burden χ 2 p

47 Fixed effects model Variance component test Q SKAT = U U m λ j χ 2 1,j j=1 where λ j is the jth eigenvalue of V = K k=1 V k Combination of burden and score statistics Q ρ = (1 ρ)q SKAT + ρq burden where ρ is adaptively chosen and the p-value can be obtained by one-dimensional numerical integration

48 Fixed effects model Summary of single variant score statistic may not enough for MiST score statistics S τ 2 = S η = (Y X α) (GW )(GW ) (Y X α), ( GG Y X α GW η) ( ) Y X α GW η, where ( α T, η T ) are obtained under τ 2 = 0.

49 Random Effects Model For k = 1,, K, β k = (β k1,, β km ) is the effect of m variants for the kth study. Random effects model β k = β 0 + ξ k where β 0 = (β 01,, β 0m ) represents the average effect among the studies, ξ k is a set of random effects representing the deviation of the kth study from the average effect ξ k N(0, Σ)

50 Heterogeneity Assume Σ = σ 2 B, where B is a pre-specified matrix to constrain the potential many parameters in Σ. A choice of B is b1 2 b 1 b 2 r b 1 b m r. B = b 2 b 1 r b m b 1 r bm 2 (b 1, b m ) controls the relative degrees of heterogeneity for the m variates (e.g., MAF), and r specifies the correlation of heterogeneity. Choice of B has no effect on the type I error but may affect the power.

51 New Random Effects Burden Test The null hypothesis H 0 : β 0 = 0, σ 2 = 0 For k = 1,..., K, β k N(β 0, Ω k = V 1 k + σ 2 B). The log-likelihood function is l = 1 2 K ( β k β 0 ) Ω 1 k ( β k β 0 ) 1 2 k=1 K log Ω k Let β k V 1 k U k, the random effects (RE) test for fixed effects where U σ = 1 2 U RE burden = U V 1 U + U2 σ V σ k=1 K k=1 U k BU k 1 2 tr(vb), V σ = 1 2 tr( K k=1 V kbv k B) For burden test, replace U by W U, and V by W VW.

52 New Random Effects Variance Component Test β 0 N(0, τ 2 W ), where W is a pre-specified matrix, e.g., w 2 1 w 1 w 2 ρ W =. w 2 w 1 ρ... w 2 d where (w 1,, w d ) controls the relative magnitude of the d average genetic effects, and ρ indicates the correlation. The null hypothesis H 0 : τ 2 = 0, σ 2 = 0 Let β = ( β 1,..., β K ), then ( ) β MVN 0, τ 2 (J K W ) + σ 2 (I K B) + diag(v 1 1,, V 1 K ) where denotes Kronecker product { β k V 1 k U k }, the score statistic is a function of U k, V k, k = 1,..., K.

53 Summary Pooled- vs meta-analysis For meta-analysis rare variant association tests can be constructed from multivariate summary statistics, i.e., the score vector U and information matrix V Fixed vs random effects model

54 Set-based gene-environment interaction m variants, G i = (G i1,, G im ) E i : environmental covariate X i : covariates Gene-environment interaction (GxE) model g{e(y i )} =X i α + E iβ E + m m G ij βj G + j=1 j=1 No interaction means β = (β GE 1,, β GE m ) = 0 (E i G ij )β GE j

55 Hierarchical model for β GE Model the interaction effect β GE j = w j η + δ j wj : a vector of known features δ j F (0, τ 2 ) The interaction effect term m j=1 (E i G ij )β GE j = ( m E i G ij w j )η + j=1 m = E i ( G ij w j )η + i=1 No interaction means H 0 : η = 0, τ 2 = 0 m E i G ij δ j j=1 m (E i G ij )δ j j=1

56 Challenges Main effects {β G 1,, βg m} may not be estimated reliably if m is large or variants are rare. Assume the main effects {β G j } are random effects such that β G j F(0, ν 2 ) Need to derive score statistics for the mixed GxE effects (η, τ 2 ) in the presence of another random effects β G j.

57 Estimation β G can be estimated by maximum posterior approach (or best linear unbiased prediction, in the linear mixed effects model), but the computation is intensive under a generalized linear model due to m-dimensional integration with no closed form. βg j minimizes ridge regression n β ridge = argmin (y i X i α E iβ E G i β G ) 2 + λ where λ = σ 2 /ν 2 i=1 m j=1 β 2 j

58 Some nice properties about ridge Knight and Fu (2000) states that if λ = o( n) then β λ is a n consistent estimator of β 0 Score statistics for the fixed effects under H 0 : η = 0, τ 2 = 0 u η = (D µ) ( m E( G j w j ) ) ( m V E( G j w j ) ) (D µ) j=1 where µ = Ê(D G, E) under η = 0, τ 2 = 0 Score statistic for the variance component under H 0 : τ 2 = 0 u τ 2 j=1 = (D û)(ge)(ge) (D µ) where û = Ê(D G, E) under τ 2 = 0

59 Combination of score statistics P-value based, Z η = 2 log P η and Z τ 2 = 2 log P τ 2 T f = Z η + Z τ 2 χ 2 4 Grid-search based optimal linear combination T o = max (ρu η + (1 ρ)u τ 2) ρ [0,1] where ρ is restricted on a set of pre-specified grid points {0 = ρ 0, ρ 1,..., ρ d = 1} Adaptive-weighted linear combination T a = Z 2 η + Z 2 τ 2 Give more weight to either burden or variance component if the evidence comes mainly from one Su YR, Di C and Hsu L (2015). A unified powerful set-based test for sequencing data analysis of GxE interactions. Submitted.

60 Power comparison m = 25 variants T o T a T f Burden Var Comp H a : 30% variants β = c H a : Half β = c, other half β = c H a : All β = c

61 Weight Choices of weight Functioncal characteristics (e.g., missense, nonsense) Screening statistics, Mj and C j are the Z statistics from marginal association screening and correlation of G and E screening { Mj if M w j = j > C j otherwise C j Since the screening statistics are independent of GxE test, no need to use permutation to calculate the p-values Jiao S, Hsu L, et al. (2013, 2015)

62 Summary Set-based association testing Mixed effects model that accounts for both burden genetic risk score and variance component Meta-analysis GxE interaction between a set of variants and environmental factor

63 Recommended readings Liu D et al. (2007). Semiparametric regression of multidimensional genetic pathway data: Least-Squares Kernel Machines and Linear Mixed Models. Biometrics 63: Liu D et al. (2008). Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics 9:292. Schaid DJ (2010) Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 70: Schaid DJ (2010) Genomic similarity and kernel methods II: methods for genomic information. Hum Hered 70: Zhang and Lin (2003). Hypothesis testing in semiparametric additive mixed models. Biostatistics 4:57 74.

64 Recommended readings Lee S, Lin X and Wu M (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13: Lin DY and Tang Z (2011). A general framework for detecting disease associations with rare variants in sequencing studies. AJHG 89: Jianping Sun, Yingye Zheng, and Li Hsu (2013). A Unified Mixed-Effects Model for Rare-Variant Association in Sequencing Studies. Genetic Epidemiology, 37: Jiao S,..., Hsu L (2015) Powerful Set-Based Gene-Environment Interaction Testing Framework for Complex Diseases. Genetic Epidemiology, DOI: /gepi Tang Z and Lin DY (2015). Meta-analysis for discovering rare-variant associations: Statistical methods and software programs. AJHG 97:35 53 Wu et al. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. AJHG 89:

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017 Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 46 Lecture Overview 1. Variance Component

More information

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure) Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Statistics & Data Sciences: First Year Prelim Exam May 2018

Statistics & Data Sciences: First Year Prelim Exam May 2018 Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota,

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu June 24, 2014 Ian Barnett

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials Biostatistics (2013), pp. 1 31 doi:10.1093/biostatistics/kxt006 Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials XINYI LIN, SEUNGGUEN

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu August 5, 2014 Ian Barnett

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic

More information

2018 2019 1 9 sei@mistiu-tokyoacjp http://wwwstattu-tokyoacjp/~sei/lec-jhtml 11 552 3 0 1 2 3 4 5 6 7 13 14 33 4 1 4 4 2 1 1 2 2 1 1 12 13 R?boxplot boxplotstats which does the computation?boxplotstats

More information

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Multilevel Models in Matrix Form Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Today s Lecture Linear models from a matrix perspective An example of how to do

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Semiparametric Mixed Model for Evaluating Pathway-Environment Interaction

Semiparametric Mixed Model for Evaluating Pathway-Environment Interaction Semiparametric Mixed Model for Evaluating Pathway-Environment Interaction arxiv:1206.2716v1 [stat.me] 13 Jun 2012 Zaili Fang 1, Inyoung Kim 1, and Jeesun Jung 2 June 14, 2012 1 Department of Statistics,

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 18: Introduction to covariates, the QQ plot, and population structure II + minimal GWAS steps Jason Mezey jgm45@cornell.edu April

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

Chapter 4 - Fundamentals of spatial processes Lecture notes

Chapter 4 - Fundamentals of spatial processes Lecture notes Chapter 4 - Fundamentals of spatial processes Lecture notes Geir Storvik January 21, 2013 STK4150 - Intro 2 Spatial processes Typically correlation between nearby sites Mostly positive correlation Negative

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown. Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g)

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression 1/37 The linear regression model Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression Ken Rice Department of Biostatistics University of Washington 2/37 The linear regression model

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

1 Mixed effect models and longitudinal data analysis

1 Mixed effect models and longitudinal data analysis 1 Mixed effect models and longitudinal data analysis Mixed effects models provide a flexible approach to any situation where data have a grouping structure which introduces some kind of correlation between

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Quasi-likelihood Scan Statistics for Detection of

Quasi-likelihood Scan Statistics for Detection of for Quasi-likelihood for Division of Biostatistics and Bioinformatics, National Health Research Institutes & Department of Mathematics, National Chung Cheng University 17 December 2011 1 / 25 Outline for

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Latent Variable Methods for the Analysis of Genomic Data

Latent Variable Methods for the Analysis of Genomic Data John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data http://genomine.org/talks/ Data m variables

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

1 Preliminary Variance component test in GLM Mediation Analysis... 3

1 Preliminary Variance component test in GLM Mediation Analysis... 3 Honglang Wang Depart. of Stat. & Prob. wangho16@msu.edu Omics Data Integration Statistical Genetics/Genomics Journal Club Summary and discussion of Joint Analysis of SNP and Gene Expression Data in Genetic

More information

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University,

More information

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability

More information

Some New Aspects of Dose-Response Models with Applications to Multistage Models Having Parameters on the Boundary

Some New Aspects of Dose-Response Models with Applications to Multistage Models Having Parameters on the Boundary Some New Aspects of Dose-Response Models with Applications to Multistage Models Having Parameters on the Boundary Bimal Sinha Department of Mathematics & Statistics University of Maryland, Baltimore County,

More information

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models Introduction to generalized models Models for binary outcomes Interpreting parameter

More information

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models. Lecture 1. Review and Introduction STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general

More information

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17 Modeling IBD for Pairs of Relatives Biostatistics 666 Lecture 7 Previously Linkage Analysis of Relative Pairs IBS Methods Compare observed and expected sharing IBD Methods Account for frequency of shared

More information

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Canonical Edps/Soc 584 and Psych 594 Applied Multivariate Statistics Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Canonical Slide

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

Chapter 4 - Fundamentals of spatial processes Lecture notes

Chapter 4 - Fundamentals of spatial processes Lecture notes TK4150 - Intro 1 Chapter 4 - Fundamentals of spatial processes Lecture notes Odd Kolbjørnsen and Geir Storvik January 30, 2017 STK4150 - Intro 2 Spatial processes Typically correlation between nearby sites

More information

Generalized Linear Models Introduction

Generalized Linear Models Introduction Generalized Linear Models Introduction Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Linear Models For many problems, standard linear regression approaches don t work. Sometimes,

More information

Sample Size and Power Considerations for Longitudinal Studies

Sample Size and Power Considerations for Longitudinal Studies Sample Size and Power Considerations for Longitudinal Studies Outline Quantities required to determine the sample size in longitudinal studies Review of type I error, type II error, and power For continuous

More information

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013 Lecture 3: Basic Statistical Tools Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013 1 Basic probability Events are possible outcomes from some random process e.g., a genotype is AA, a phenotype

More information

Least Squares Estimation-Finite-Sample Properties

Least Squares Estimation-Finite-Sample Properties Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29 Terminology and Assumptions 1 Terminology and Assumptions

More information

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to 1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Hierarchical Modeling for Univariate Spatial Data

Hierarchical Modeling for Univariate Spatial Data Hierarchical Modeling for Univariate Spatial Data Geography 890, Hierarchical Bayesian Models for Environmental Spatial Data Analysis February 15, 2011 1 Spatial Domain 2 Geography 890 Spatial Domain This

More information

Sample size calculations for logistic and Poisson regression models

Sample size calculations for logistic and Poisson regression models Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III) Title: Spatial Statistics for Point Processes and Lattice Data (Part III) Lattice Data Tonglin Zhang Outline Description Research Problems Global Clustering and Local Clusters Permutation Test Spatial

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April

More information

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random STA 216: GENERALIZED LINEAR MODELS Lecture 1. Review and Introduction Much of statistics is based on the assumption that random variables are continuous & normally distributed. Normal linear regression

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information

Hierarchical Modelling for Univariate Spatial Data

Hierarchical Modelling for Univariate Spatial Data Hierarchical Modelling for Univariate Spatial Data Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Lecture 9: Linear Regression

Lecture 9: Linear Regression Lecture 9: Linear Regression Goals Develop basic concepts of linear regression from a probabilistic framework Estimating parameters and hypothesis testing with linear models Linear regression in R Regression

More information

The International Journal of Biostatistics

The International Journal of Biostatistics The International Journal of Biostatistics Volume 2, Issue 1 2006 Article 2 Statistical Inference for Variable Importance Mark J. van der Laan, Division of Biostatistics, School of Public Health, University

More information

THE STATISTICAL ANALYSIS OF GENETIC SEQUENCING AND RARE VARIANT ASSOCIATION STUDIES. Eugene Urrutia

THE STATISTICAL ANALYSIS OF GENETIC SEQUENCING AND RARE VARIANT ASSOCIATION STUDIES. Eugene Urrutia THE STATISTICAL ANALYSIS OF GENETIC SEQUENCING AND RARE VARIANT ASSOCIATION STUDIES Eugene Urrutia A dissertation submitted to the faculty at the University of North Carolina at Chapel Hill in partial

More information