Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Jan 14, 2018 11:47 · 1253 words · 6 minutes read predict disease risk ld pruning

Meta data of reading

  • Journal: AJHG
  • Year: 2015
  • DOI: 10.1016/j.ajhg.2015.09.001

\[ \newcommand\E{\text{E}} \newcommand\var{\text{Var}} \newcommand\diag{\text{diag}} \]

In brief

This paper proposed a method to estimate disease risk using GWAS summary statistics. The previous method is to do LD pruning and p-value thresholding but it potentially drop out informative SNPs and leads to lack of explained heritability. This paper proposed a Bayesian polygenic risk scores (PRS), LDpred, that estimates the posterior mean causal effect sizes from GWAS results by assigning prior on genetic architecture of the disease and the LD information from a reference panel.

The model

Assume that phenotype \(Y\), genotype \(X\) are both standardized. \[\begin{align*} Y &= \sum_{i = 1}^M X_i \beta_i + \epsilon \\ \hat\beta_i &= X_i' Y / N \end{align*}\]

, where \(\hat\beta_i\) is marginal effect estimate.

If \(z_i\) is provided instead, \(\hat\beta_i = s_i(z_i / \sqrt{N})\), where \(s_i\) is the sign of \(z_i\).

Unadjusted PRS

\[\begin{align*} S_i &= \sum_{j = 1}^M X_{ji} \hat\beta_j \end{align*}\]

, where \(S_i\) is the predicted diseases risk of \(i\)th individual. Note that here all SNPs are used to predict the outcome and marginal effects are used which implicitly assume that every loci is not correlated with each other.

P+T

In practice, LD pruning and p-value thresholding improve the accuracy. For instance, LD pruning with \(r^2 > 0.2\) and p-value thresholding with various values over a grid to optimize the performance (validation data should be available).

Bpred

In the sense of minimizing prediction error variance, the posterior mean prediction is optimal linear prediction (see Bayes estimator on wikipedia). Namely, \[\begin{align} \E(Y | \tilde\beta, \hat{D}) &= \sum_{i = 1}^M X_i' \E(\beta_i | \tilde\beta, \hat{D}) \label{eq:posterior-mean-estimate} \end{align}\]

, where \(\tilde\beta\) is the marginal least-squares estimate and \(\hat{D}\) is observed LD in the training data.

Under standardized \(X\), \(\var(Y) = h_g^2 \Theta + (1 - h_g^2) I\) because \[\begin{align*} \var(Y) &:= \var_{\beta, \epsilon}(Y) \\ &= \var_{\beta, \epsilon}(X\beta + \epsilon) \\ &= \var_{\beta}(X\beta) + \var_{\epsilon}(\epsilon) \quad \text{, since $\beta$, $\epsilon$ are independent} \\ &= h_g^2 (XX' / M) + (1 - h_g^2) I \quad\text{, here it seems that one assumption is $\beta$ iid}\\ &= h_g^2 \Theta + (1 - h_g^2) I \end{align*}\] If all samples are independent and marks are unlinked with \(\beta_i \sim_{iid} \mathcal{N}(0, h_g^2 / M)\) (Gaussian infinitesimal prior), then \[\begin{align*} \E(\beta_i | \tilde{\beta}) &= \E(\beta_i | \tilde\beta_i) = \frac{h_g^2}{h_g^2 + M / N} \tilde\beta_i \\ \text{cor}(\hat{Y}_\text{PRS}, Y) &= \frac{h_g^2}{h_g^2 + M / N} h_g^2 \end{align*}\] If a Gaussian mixture is assumed instead, namely \[\begin{align*} \beta_i &\sim \begin{cases} \mathcal{N}(0, h_g^2 / (h_g^2 + Mp/N)) & \text{prob } p \\ 0 & \text{prob } (1 - p)\end{cases} \end{align*}\] , then \[\begin{align*} \E(\beta_i | \tilde\beta) &= \frac{h_g^2}{h_g^2 + Mp / N} \bar{p}_i \tilde\beta_i \end{align*}\]

LDpred

If allowing loci to be linked but distant loci are not linked, under Gaussian infinitesimal prior \[\begin{align*} \E(\beta^l | \tilde\beta^l, D) &\approx (\frac{M}{Nh_g^2}I + D_l)^{-1} \tilde\beta^l \end{align*}\]

For Gaussian mixture prior, the posterior mean is obtained by MCMC Gibbs sampler.

Some derivations

This section sketches some derivations of the results listed above. The term ‘unlinked’ means that the single-locus model captures the proper effect which is the same as the multi-loci model. The term ‘infinitesimal’ means that the effect \(\beta_i\) with measure zero at zero.

Unlinked markers and infinitesimal effect

Suppose \(\beta_i \sim \mathcal{N}(0, h^2 / M)\) (then variance of \(\beta\) is \(h^2\)). \[\begin{align*} \tilde\beta_i | \beta_i &\sim \mathcal{N}(\beta_i, \frac{1}{N}(1 - h^2 / M)) \\ &\quad\text{, since 'unlinked', one loci explains $h^2/M$ variance}\\ \beta_i | \tilde\beta_i &\sim \mathcal{N}(\frac{1}{1 + M / h^2N}\tilde\beta_i, \frac{1}{N}\frac{1}{1 + M / h^2N}) \\ &\quad\text{, derived from prior and conditional with large $M$} \end{align*}\]

Unlinked markers and non-infinitesimal effect

Suppose \[\begin{align} \beta_i &\sim \begin{cases} \delta_0(\beta_i) & \text{, with prob $1 - p$} \\ \mathcal{N}(0, h^2 / Mp)(\beta_i) & \text{, with prob $p$} \end{cases} \label{eq:auxiliary} \end{align}\]

, where \(\var(\beta_i)\) is kept to be \(h^2 / M\)

Let’s first introduce auxiliary variable \(Z_i \sim \text{Ber}(p)\) and \[\begin{align*} f(\beta_i| Z_i) &= (1 - Z_i) \delta_0(\beta_i) + Z_i\mathcal{N}(0, h^2/Mp)(\beta_i) \end{align*}\] , then \(f(\beta_i)\) is kept to be . The independences of variables is Similarly, \[\begin{align*} \tilde\beta_i | \beta_i &\sim \mathcal{N}(\beta_i, \frac{1}{N}(1 - h^2 / M)) \\ &\quad\text{, since $\var(\beta_i)$ is kept the same as before} \\ f(\beta_i | \tilde\beta_i, Z_i = 0) &\propto f(\beta_i | Z_i = 0) \times f(\tilde\beta_i | \beta_i) \\ &\propto \delta_0(\beta_i) \times \mathcal{N}(\beta_i, \sigma_1^2)(\tilde\beta_i) \\ &\propto \delta_0(\beta_i) \\ f(\beta_i | \tilde\beta_i, Z_i = 1) &\propto \mathcal{N}(0, \sigma_2^2)(\beta_i) \times \mathcal{N}(\beta_i, \sigma_1^2)(\tilde\beta_i) \\ &\propto \mathcal{N}(k \tilde\beta_i, k / N) \end{align*}\]

, where \(\sigma_1^2 = \frac{1}{N}(1 - h^2 / M), \sigma_2^2 = \var(\beta_i), k = \sigma_2^2 / (\sigma_1^2 + \sigma_2^2)\).

The posterior mean is \[\begin{align*} \E(\beta_i | \tilde\beta_i) &= \E_{Z_i | \tilde\beta_i}[\E(\beta_i | \tilde\beta_i, Z_i)] \\ &= \E_{Z_i | \tilde\beta_i}[Z_i \times k\tilde\beta_i + (1 - Z_i) \times 0] \\ &= k\bar{p}_i\tilde\beta_i \quad\text{, where $\bar{p}_i = \E(Z_i | \tilde\beta_i)$} \end{align*}\]

Linked markers and infinitesimal effect

Suppose the \(i\)th region contains \(N_i\) SNPs and the SNPs are linked with each other within the region but no correlation outside the region. Then, the marginal effect \(\tilde\beta_i \in \mathbb{R}^{M_i}\) conditional on \(\beta_i\) is \[\begin{align*} \E(\tilde\beta_i | \beta_i) &= D_i \beta_i \\ \var(\tilde\beta_i | \beta_i) &= \frac{1}{N} (1 - h_i^2) D_i \\ \text{, where} & \\ D_i &= X_i'X_i / N \\ \tilde\beta_i &= X_i'Y / N \\ h_i^2 &\text{is} \text{ phenotype variance explained by the $i$th region} \end{align*}\]

, which can be deduced by replacing \(Y\) with \(X_i\beta_i + \epsilon_i\).

Then, \[\begin{align*} \tilde\beta_i | \beta_i &\sim \mathcal{N}(D_i\beta_i, \Sigma_1) \\ \beta_i &\sim \mathcal{N}(0, \Sigma_2) \\ \text{, where} & \\ \Sigma_1 &= \frac{1}{N} (1 - h_i^2) D_i \\ \Sigma_2 &= (h_i^2 / M)I \end{align*}\] , so that we can get the posterior distribution \[\begin{align*} \beta_i | \tilde\beta_i &\sim \mathcal{N}(A\tilde\beta_i, \Sigma) \\ \text{where, }& \\ \Sigma &= (\Sigma_2^{-1} + D_i\Sigma_1^{-1}D_i)^{-1} = \frac{1}{N}(\frac{1}{1 - h_i^2}D_i + \frac{M}{Nh_i^2}I)^{-1}\\ A &= \Sigma D_i \Sigma_1^{-1} \\ \text{when } & h^2 \approx 0 \\ A &\approx (\frac{1}{1 - h_i^2}D_i + \frac{M}{Nh_i^2}I)^{-1} \end{align*}\]

Linked markers and non-infinitesimal effect

The goal is to obtain posterior mean and the paper proposed to use a Gibbs sampler by sampling \(\beta_i | \tilde\beta, \beta_{-i}\). Specifically, the sampling can be decomposed as \(Z_i | \tilde\beta, \beta_{-i}\) and \(\beta_i | \tilde\beta, \beta_{-i}, Z_i\). These two variables can be approximately sampled as follow \[\begin{align*} \Pr(Z_i | \tilde\beta, \beta_{-i}) &\approx \Pr(Z_i | \tilde\beta_i, \beta_{-i}) \\ f(\beta_i | \tilde\beta, \beta_{-i}, Z_i = 1) &\approx f(\beta_i | \tilde\beta_i, \beta_{-i}, Z_i = 1) \\ \end{align*}\]

Comment: At first, I did not get the idea of this approximation. After reading their code on Gibbs sampler, I kind of get the point.

Consider two-loci case in term of the graphical model

Since \(\tilde\beta_i\) is collidar node, \(f(\beta_i|\tilde\beta, \beta_{-i}) \ne f(\beta_i | \tilde\beta_i, \beta_{-i})\). To obtain \(f(\beta_i | \tilde\beta, \beta_{-i})\) \[\begin{align*} f(\beta_i | \tilde\beta, \beta_{-i}, Z_i = 1) &= \int f(\tilde\beta | \beta) \prod_{j \ne i} f(\beta_j) \mathcal{N}(0, \sigma^2)(\beta_i) d\beta_i \\ &= \prod_{j \ne i} f(\beta_j) \frac{1}{\sqrt{2\pi|\Sigma|}}\int \exp\{-\frac{1}{2}(\tilde\beta - D\beta)' \Sigma^{-1} (\tilde\beta - D\beta)\} \exp\{-\frac{1}{2}\beta_i^2\sigma^{-2}\} d\beta_i \\ &= \prod_{j \ne i} f(\beta_j) \frac{1}{\sqrt{2\pi|\Sigma|}} \times \sqrt{2\pi(\omega + \hat\beta'\Omega\hat\beta)} \int \exp\{\hat\beta'X\hat\beta)\} \quad\text{, where $X = \Omega - \Omega(\Omega + \Sigma_2)^{-1}\Omega$} \end{align*}\]

, where \(\hat\beta = \tilde\beta - \beta|_{\beta_i = 0}\) and \(\Omega = \Sigma^{-1}, \omega = \sigma^{-2}, \Sigma_2 = \diag(1, ..., \omega, 1, ..., 1)\) at the \(i\)th entry. To compute \(|\Sigma|\) is computationally intensive. To avoid it, we need to work with univariate normal instead. Therefore, the paper proposed to use \(f(\beta_i|\tilde\beta_i, \beta_{-i})\) to approximate it since \(\tilde\beta_i | \beta\) is simply \(\mathcal{N}(\beta_i, (1 - h^2) / N)\) (directly get from \(\tilde\beta|\beta\)).