Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits
Dec 16, 2017 11:06 · 1664 words · 8 minutes read
Meta data of reading
- Journal: Nature Genetics
- Year: 2012
- DOI: 10.1038/ng.2213
$$ \newcommand\independent{\perp\!\!\!\!\perp} \newcommand\E{\text{E}} \newcommand\nocr{\nonumber\cr} \newcommand\cov{\text{Cov}} \newcommand\var{\text{Var}} \newcommand\cor{\text{Cor}} \newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}} $$
Motivation
The GWAS procedure is to use single-SNP model to test association and select the SNP with strongest signal to represent the genomic region (~ 2Mb) and the genetic variation is computed based on this SNP only. The paper pointed out the underlying assumptions of this procedure:
Implicit assumptions, often untested, are that the detected association at the top SNP captures the maximum amount of variation in the region by its LD with an unknown causal variant and that other SNPs in the vicinity show association because they are correlated with the top SNP.
They pointed out 2 reasons why this assumption may fail:
- Suppose there is only one causal variant, a single tagging SNP may not capture all of its variation
- It is possible that there are more than one causal variant
So, one-SNP-per-locus procedure may underestimate the underlying causal genetic variation. Some studies have performed conditional analysis to find the secondary SNP inside the locus. The paper proposed a systematic approach to perform conditional analysis by combining GWAS meta-analysis and LD correlation from the same population.
Multi-SNP model and joint effect
The multi-SNP model is:
, where $X \in \mathbb{R}^{n \times N}, \vec{b} \in \mathbb{R}^N, \vec{e} \in \mathbb{R}^n$
. The least squares solution is:
Note that \eqref{eq:varj} is an N-by-N (co)variance matrix. It is derived as follows:
Note that in \eqref{eq:inter1}, we have $[(X'X)^{-1}_j X'] \in \mathbb{R}^{1 \times n}$
and $\var(e) \in \mathbb{R}^{n \times 1}$
which are vectors. But notice that $e_1, …, e_n$ are iid. The expression can be simplified as:
, where $t_j := [(X'X)^{-1}]_j$
. Similarly,
So,
, which is the result of \eqref{eq:varj}.
Single-SNP model and marginal effect
In practice, only single-SNP model results are available. The single-SNP model is as follow:
So, the LS estimator is:
, where $D = diag(\vec{d}), d_i = X_i' X_i$
and $X_i$ is the $i$th column of $X$. The derivation of \eqref{eq:varm} is as follow:
Here, we treat single-SNP model as the truth. Note that we treat SNPs independently with each other (namely $\cov(\hat\beta_i, \hat\beta_j) = 0$
). Different SNPs do not have to share the same residual variance, so a more precise expression is $\var(\hat\beta_i) = \sigma^2_{M(i)} (X_i' X_i)^{-1}$
Inferring joint effect from single effect
From \eqref{eq:se}, we have $X'y = D \hat\beta$
. Under multiple SNP model, the proportion of variance explained by all SNPs is:
Then, we can derive:
Similarly,
From \eqref{eq:varm}, we have $\var(\hat\beta_j) = \hat\sigma^2_{M(j)} / D_j$
, so we get:
This expression provides a way to obtain $y’y$ with $\hat\beta_j$ and $\hat{\var}(\hat\beta_j)$ (w/o knowing individual level data $y$). In practice, the paper used the median of inferred $y’y$ obtained from $j = 1, …, N$.
- Side note
The reason why the above analysis is performed is to obtain joint model statistic from the single-SNP model without querying individual level data $y$. One useful relation is: $\hat\sigma^2 = \frac{(1 - R^2) y’y}{N - n}$. $R^2$ is computable since it is just the observed proportion of covariance between predictor and response in the overall variance of response. For the single-SNP case, $\hat\sigma^2$ is available via \eqref{eq:varm}.
In meta-analysis, $X$ is not available as well. But since $X’X$ is the (co)variance matrix of SNP genotypes, it can be approximated by the LD score from a matched reference population. The paper used $W$ to denote such population, where $w_{ij} = -2f_j, 1 - 2 f_j, 2 - 2 f_j$
to denote genotypes: two major alleles, heterozygous, two minor alleles respectively. Under this set up, $\E(w_j) = 0, \var(w_j) = 2f_j(1 - 2 f_j)$
. Therefore, we have:
where $D, D_W$ is the diagonal matrix with diagonal entries of $X’X, W’W$. $D_{j}$ is not available without $X$, but it can be approximated by $2p_j(1 - p_j)n$
. From \eqref{eq:sej}, \eqref{eq:se}, \eqref{eq:app}, we have:
Similar to \eqref{eq:varj},
In the paper, distant SNP pairs were assigned zero correlation instead the observed one in $W$. The paper pointed out an additional complexity in practice, that is SNPs may have different effective sample sizes due to imputation failures. Therefore, the paper suggested to estimate the effective sample for each SNP and use the adjusted sample size to compute $B_{jk}$. The procedure is:
- From \eqref{eq:yy}, we obtain $y’y$ by taking the median
- Obtain $\hat{n}_j$ using \eqref{eq:yy}:
$$\begin{align} \hat{n}_j &= y’y / D_j \var(\hat\beta_j) - \hat\beta_j^2 / \var(\hat\beta_j) + 1 \nonumber \end{align}$$
- Compute $B_{jk}$ and $D_j$ using the adjusted sample size:
$$\begin{align} B_{jk} &= 2 \min(\hat{n}_j, \hat{n}_k) \sqrt{\frac{p_j(1 - p_j)p_k(1 - p_k)}{D_{W(j)}D_{W(k)}}} (W’W)_{jk} \nocr W_j &= 2p_j(1 - p_j) \hat{n}_j \nonumber \end{align}$$
Obtain p-value of marginal effect under multi-SNP model
In brief, the above derivation provides a way to infer joint effect distribution from single-SNP model summary statistic. Namely, we get:
The marginal distribution for each SNP’s effect size, $\tilde{b}_i$, is:
Therefore, we can construct a test as follow:
- $H_0$: $b_i$ is zero
- $H_1$: $b_i$ is not zero
Under the null, $\tilde{b}_i \sim \mathcal{N}(0, \var(\tilde{b}_i))$. Then $\mathbb{P}_{H_0}(|b_i| > |\tilde{b}_i| ) = 2(1 - \Phi(|\tilde{b}_i|))$
, which is the marginal effect of $i$th SNP under multi-SNP model.
Conditional analysis
The logic of this part is not very intuitive to me, but after some struggling, I end up with the following things.
First of all, the conditional analysis takes a two step procedure to estimate $\hat{b}_2 | \hat{b}_1$. That is:
- Do $y \sim X_1$ and obtain
$\bar{b}_1 = (X_1'X_1)^{-1} X_1'y$
- Compute $\tilde{y} = y - X_1 \bar{b}_1$
- Do $\tilde{y} \sim X_2$ and obtain $\hat{b}_2 | \hat{b}_1 = (X_2’ X_2)^{-1} X_2’ \tilde{y}$, which matches the equation 15 in the text
The variance of $\hat{b}_2 | \hat{b}_1$ stuck me for a while because it is unclear how to define $\hat\sigma^2_C$. It is still not so clear to me but what I get is the following:
So, it seems to me that $\sigma^2_C$ and $\sigma^2_J$ are interchangeable if $X = [X_1, X_2]$
. If such condition is not satisfied, \eqref{eq:y} is not the multi-SNP model, so it is better to denote the residual variance as $\sigma^2_C$. The paper computed $\hat\sigma^2_C$ using the following equation:
Similar to previous derivation, the individual level statistics can be replaced by $D, \beta, B$. Note that $M_{12}, M_{21}, M_1, M_2$
can be approximated by $B$.
Results in brief
The paper performed the analysis using GIANT GWAS and two reference genotype data. If two SNPs were in low LD, the result was similar to singe-SNP model. For positively correlated SNPs (modest LD), single-SNP model tended to overestimate the effect size. While multi-SNP model gave smaller effect size, they still reached genome-wide significance. For negatively correlated SNPs, single-SNP model tended to miss one of the signal because the signal was masked by the other one. Multi-SNP model was more powerful in this case. They found 36 loci with multiple signals with 38 leading SNPs and 49 additional SNPs. The result is robust to the choice of reference sample. Conditional analysis was also performed to identify secondary associations in the loci. The analysis was also applied to case-control study where $y$ is OR instead of quantitative trait.