Optimal tests for rare variant effects in sequencing association studies

Dec 28, 2017 15:45 · 947 words · 5 minutes read hypothesis testing variance component test

\[ \newcommand\E{\text{E}} \newcommand\nocr{\nonumber\\} \newcommand\var{\text{Var}} \newcommand\diag{\text{diag}} \]

Meta data of reading

  • Journal: Biostatistics
  • Year: 2012
  • DOI: 10.1093/biostatistics/kxs014

Motivation

Rare variants have very low MAF so GWAS doesn’t work under current sample size. To increase power, the strategy is to aggregate variants according to biological knowledge (or assumptions). There are two main types of tests: burden tests and non-burden tests. Burden test is region-based and it suffers from the assumption that all rare variants have the same direction of the effect and the power is low if there are too many non-causal variants. The non-burden test, sequence kernel association test (SKAT) does not make such assumption but it is less powerful than burden test if the assumption is almost true. The motivation of this paper is to propose a general version of hypothesis test which includes burden/non-burden tests as special cases. It provides a way to maximize the power no matter how the true model looks like.

The approach

They proposed to construct a class of test that is an arbitrary linear combination of burden test and SKAT statistics. The optimal test could be identified by maximizing the power within the class.

The goal is to test whether the rare variants in a single region are associated with a complex trait. Suppose sample size is \(n\) and the number of SNVs is \(p\). For \(i\)th subject, \(y_i\) is phenotype and \(G_i = (g_{i1}, ..., g_{ip})\) is genotype where \(g_{ij} = 0, 1, 2\). \(X_i = (x_{i1}, ..., x_{iq})\) is the covariates that are needed to be adjusted for. For continuous/categorical phenotypes, they used generalized linear model where \(\E(y_{i}) = \mu_i\) and \(\var(y_i) = \phi \nu(\mu_i)\) with a link function

\[\begin{align} g(\mu_i) &= X_i \alpha + G_i \beta \label{eq:glm} \end{align}\]

\(\nu(\cdot)\) is a variance function and \(\alpha, \beta\) are the vectors of regression coefficients for covariates and rare variants. With \(\eqref{eq:glm}\), the test for association can be constructed as \(H_0: \beta = (\beta_1, ..., \beta_p)' = \vec{0}\). But such test suffers from large degree of freedom (d.f.) (\(p\)). To reduce d.f., additional assumptions were made.

Burden-based tests

The popular burden-based tests reduce the d.f. by making the assumptions that each \(\beta_j\) is a function of the MAFs such that \(\beta_j = w(m_j) = w_j\beta_0\), where \(m_j\) is the MAF of the \(j\)th variant. With this assumption, \(\eqref{eq:glm}\) becomes:

\[\begin{align} g(\mu_i) &= X_i \alpha + \beta_0 \sum_{j = 1}^p w_j g_{ij} \label{eq:burden} \end{align}\]

Then the test \(H_0: \beta_0 = 0\) has 1 d.f. The test was referred as weighted counting burden test (WBT).

SKAT

SKAT takes a different approach to reduce d.f. It assumes that each \(\beta_j\) independently follows an arbitrary distribution with mean zero and variance \(w_j^2 \psi\), where \(w_j\) is a fixed number that may depend on MAF. With this assumption, the null hypothesis \(H_0: \beta = \vec{0}\) is equivalent to \(H_0: \psi\) under variance component test in GLMM. Suppose

  1. \(X\) is \(n \times q\)
  2. \(G\) is \(n \times p\)
  3. \(W = \diag(w_1, ..., w_p)\)

\(K = GWWG'\) is \(n \times n\) is an weighted linear kernel matrix. SKAT paper proposed to use a class of flexible weight functions of MAF, \(w_j = \text{Beta}(m_j, a_1, a_2)\) where \(a_1, a_2\) were pre-specified and \(m_j\) were estimated using the sample MAF of the \(j\)th variant.

Define the working vector by \(y^* = X\alpha + \Delta (y - \mu)\), where \(\Delta = \diag\{g'(\mu_i))\), and the variance matrix by \(V = \diag(\phi \nu(\mu_i)[g'(\mu_i)]\}\). The estimates under the null were: \(\tilde{y} = X\hat\alpha + \hat\Delta (y - \hat\mu)\), \(\hat\Delta = \diag\{g'(\hat\mu_i))\), and \(\hat{V} = \diag(\phi \nu(\hat\mu_i)[g'(\hat\mu_i)]\}\). Where all estimates were obtained under the null hypothesis. It turned out that the score test statistic of the variance component \(\psi\) is

\[\begin{align} Q &= (\tilde{y} - X\alpha)' \hat{V}^{-1} K \hat{V}^{-1} (\tilde{y} - X\hat\alpha) \nocr &= (y - \hat\mu)' \hat\Delta \hat{V}^{-1} K \hat{V}^{-1} \hat\Delta (y - \hat\mu) \nonumber \end{align}\]

This result follows from the derivation of the variance component test of GLMM in this paper equation 8.

This paper proposed

The weighted linear kernel was constructed under the assumption that \(\beta_j\)s were independent. But it is not true if the region is of high risk to have deleterious mutations. This paper introduced a new family of kernels. They proposed to allow \(\beta\) to follow a multivariate distribution with exchangable correlation structure. Let the correlation matrix of \(\beta\) be \(R_\rho = (1 - \rho)I + \rho \vec{1} \vec{1}'\). Then the SKAT test statistic becomes

\[\begin{align} Q_{\rho} = (y - \hat\mu)' \hat\Delta \hat{V}^{-1} K_\rho \hat{V}^{-1} \hat\Delta (y - \hat\mu) \nonumber \end{align}\]

This family of statistic contains both WBT and SKAT:

  1. \(\rho = 0\), it becomes SKAT
  2. \(\rho = 1\), it becomes WBT

Then, \(Q_\rho = (1 - \rho) Q_{SKAT} + \rho Q_{burden}\). For a fixed \(\rho\), \(Q_\rho\) follows a mixture of \(\chi^2\) distributions. Namely if \(\lambda_1, ..., \lambda_m\) are eigenvalues of \(\hat{V}^{-1/2}K_\rho\hat{V}^{-1/2}\), the null distribution of \(Q_\rho\) can be closely approximated by \(\sum \lambda_j \chi_{1, j}^2\), where \(\chi_{1, j}^2\)s are iid \(\chi_1^2\) random variables. The p-values can be obtained by matching moments or inverting characteristic function.

Optimal test

In practice, \(\rho\) is unknown. This paper proposed the test statistic \(T = \inf_{0 \le \rho \le 1} p_\rho\) to obtain tthe optimal performance. It turns out that \(Q_\rho\) can be approximated by the mixture of two independent random variables (\((1 - \rho)\kappa + \tau(\rho)\eta_0, \kappa = \sum_k \lambda_k \eta_k + \xi\)) which can be further approximated. Let \(q_{\min}(\rho)\) denote the \((1 - T)\)th percentile of the distribution of \(Q_\rho\). Then the p-value of \(T\) is

\[\begin{align} 1 - \Pr(Q_{\rho_1} &< q_{\min}(\rho_1), ..., Q_{\rho_b} < q_{\min}(\rho_b)) \nocr &= 1 - \E[\Pr(\kappa < \min\{(q_\min(\rho_nv) - \tau(\rho_\nu)\eta_0)/(1 - \rho_\nu)\})|\eta_0] \end{align}\]

, which can be computed efficiently.