1 About

In this post, we run BrainXcan with an example.

First of all, please make sure you’ve cloned Brainxcan code repository and installed all the BrainXcan software dependencies.

To perform BrainXcan analysis, we need to prepare the following files:

The BrainXcan database and meta files (download from brainxcan database zenodo link)
- To run S-BrainXcan with permutation within LD blocks, you will also need an LD block BED file, e.g. link and we have it copied here as well
A file containing GWAS summary statistics. It should have the following information:
- SNP rsID (we use rsID as SNP identifier) and chromosome of the SNP.
- Effect and non-effect allele
- Summary statistics (one in the two groups):
  1. Group 1: effect size, standard error of effect size (zero-centered)
  2. Group 2: z-score, GWAS sample size, allele frequency
A pipeline configuration file.

2 Downloading BrainXcan database

The BrainXcan database and meta files can be downloaded from brainxcan database zenodo link. The directory structure is

brainxcan_data/
|-- bxcan_vis
|-- geno_covar
|-- idp_gwas
|-- idp_weights
`-- mr

There is no need to change the structure and it is designed to be used asis.

See some detailed notes on the BrainXcan database at “Looking into BrainXcan database”.

3 Downloading GWAS

In this example, we provide a pre-formatted GWAS file at here: preview link and direct download link, which is adapted from a schizophrenia GWAS by Ripke et al. (2020) (medRxiv link).

First, let’s download and take a look at it.

# download
wget https://uchicago.box.com/shared/static/s2yg1gvxm5vv1q3yijf5h389o3hkadwv.gz \
  -O formatted.SCZ_PGC_2020.tsv.gz
  
# look at the first few rows
zcat < formatted.SCZ_PGC_2020.tsv.gz | head

# returns
variant_id  chromosome  non_effect_allele   effect_allele   effect_size standard_error  zscore  sample_size frequency
rs3131972   chr1    G   A   -0.0020020026706730793  0.0111  -0.18439957 71533   0.1652
rs3131969   chr1    G   A   -0.006601743634562364   0.0118  -0.56070304 71533   0.1339
rs3131967   chr1    C   T   -0.005796768847177342   0.0119  -0.48651794 71533   0.1171
rs1048488   chr1    T   C   0.002696361547742533    0.0117  0.2303755   71533   0.1667
rs12562034  chr1    G   A   -0.003902375817241093   0.0156  -0.24843223 66899   0.1027
rs12124819  chr1    A   G   -0.011800104115750618   0.0181  -0.64967835 58941   0.2243
rs4040617   chr1    A   G   -0.0010993954433010642  0.0127  -0.09036144 71843   0.1295
rs4970383   chr1    C   A   0.02810116511090993 0.0111  2.5433347   64548   0.1964
rs4475691   chr1    C   T   0.012102946067745635    0.0121  0.9998151   65154   0.1429

4 Preparing the BrainXcan configuration file

TLDR: The configuration file is in YAML format and the one for example run can be downloaded from here. And there are a few fields that you need to fill in.

If you are not in a hurry, we recommend going through the following sections which explain the configuration file in details.

The configuration file contains three main parts:

Input/output file paths.
Meta information about the GWAS file.
Some parameter settings in BrainXcan.

We will go over them one by one.

4.1 Specifying input/output paths

datadir: Path to the BrainXcan database folder.
outdir: Output directory name (if it does not exist, it will be generated).
prefix: Prefix of the output files (to distinguish between results in the same output folder).
brainxcan_path: Path to the BrainXcan code repository.

In the configuration YAML, we have:

datadir: 'path-to-braixcan-database/brainxcan_data'
outdir: 'brainxcan_test'
prefix: 'SCZ'
brainxcan_path: 'path-to-brainxcan-repository/brainxcan'

4.2 Specifying GWAS meta information

Here we want to tell the software how to understand the GWAS file.

gwas: Path to GWAS file (formats: csv(.gz), tsv(.gz), parquet).
snpid: Column name of GWAS SNP rsID
effect_allele: Column name of GWAS effect allele
non_effect_allele: Column name of GWAS non-effect allele
chr: Column name of chromosome
position: Column name of position

Along with Group 1 or 2 (see details at here):

Group 1:
- effect_size: Column name of GWAS effect size
- effect_size_se: Column name of GWAS effect size standard error
Group 2:
- zscore: Column name of GWAS z-score
- sample_size: Column name of GWAS sample size
- allele_frequency: Column name of allele frequency

When both groups are specified, group 1 has high priority. For the example run, we go with group 1.

In the configuration YAML, we have:

gwas: 'formatted.SCZ_PGC_2020.tsv.gz'
snpid: 'variant_id'
effect_allele: 'effect_allele'
non_effect_allele: 'non_effect_allele'
chr: 'chromosome'
effect_size: 'effect_size'
effect_size_se: 'standard_error'
position: 'position'

4.3 Specifying parameters in BrainXcan

There are several parameters underlying a BrainXcan run. They are associated with three aspects:

BrainXcan association test
- model_type: IDP prediction model type: ridge or elastic_net (Default = ridge)
- idp_type: IDP sets to use: original or residual (after PC adjustment) (Default = residual)
- spearman_cutoff: CV Spearman cutoff on models (only models passing this criteria will be shown) (Default = 0.1)
BrainXcan adjustment options
- (Recommended) Adjusting z-scores with variance control (The method details are described here)
  - It requires GWAS effective sample size, GWAS trait heritability, and pre-computed phi values for IDP models (IDP phi values can be downloaded from here)
  - pheno_h2: GWAS trait heritability (SNP heritability - e.g. it can be estimated from LDSC regression)
  - gwas_n: GWAS effective sample size
  - bxcan_vc_z: Set to True when want run this adjustment
  - bxcan_vc_phi: Path to the phi value file
- (Not recommended) Adjusting z-scores with LD block-based permutation
  - bxcan_ldblock_perm: If want to run this adjustment, set the path to the BED file (TAB-delimited base0 with header chr, start, stop)
  - bxcan_ldblock_perm_seed: Random seed
  - bxcan_ldblock_perm_nrepeat: Number of permutation across all IDPs
  - correction_factor_perm: Correcting the standard error of the simulated z-scores by multiplying with this factor (default = 1.1)
- (Not recommended) Adjusting z-scores via simulated weights
  - bxcan_empirical_z: Whether to run this adjustment or not
  - bxcan_empirical_z_seed: Random seed
  - bxcan_empirical_z_nrepeat: Number of simulated IDPs
  - correction_factor_emp: Correcting the standard error of the simulated z-scores by multiplying with this factor
Mendelian randomization (MR): extrating a list of significant IDP and run MR
- fdr_cutoff: FDR cutoff to call an IDP significant (Default = 0.1). This will overwrite signif_pval below.
- signif_pval: P-value cutoff to call an IDP significant (Default = 1e-5)
- signif_max_idps: Maximun number of IDPs to run MR (discard by p-value if too many IDPs passing p-value filter) (Default = 10)
- ld_clump_yaml: Parameters of LD clumping for defining the instrument SNPs in MR (Default = {datadir}/mr/ld_clump.yaml)$\dagger$
Dependent executables
- rscript_exe: Path to the Rscript executable (Default = Rscript)
- python_exe: Path to the Python executable (Default = python)
- plink_exe: Path to the plink v1.9 executable (Default = plink)

$\dagger$ {datadir} will be interpreted as path-to-braixcan-database/brainxcan_data internally since we set datadir: path-to-braixcan-database/brainxcan_data.

In the configuration YAML, we have:

model_type: 'ridge'
idp_type: 'residual'
spearman_cutoff: 0.1 

fdr_cutoff: 0.1
signif_pval: 0.01
signif_max_idps: 5
gwas_pop: 'EUR'
ld_clump_yaml: '{datadir}/mr/ld_clump.yaml'

rscript_exe: 'Rscript' 
python_exe: 'python'
plink_exe: 'plink'

The full list of parameters available in a BrainXcan configuration YAML is explained (with default values) at here

5 Running BrainXcan

Now we are ready to run BrainXcan pipeline. The general command is

# conda activate brainxcan 

REPO=path-to-brainxcan-repository
CONFIG=path-to-config-file

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  [rule-name]

[rule-name] can take the following four options:

rule.name	Association	Visualization	Mendelian.randomization	Automated.report
SBrainXcanOnly	x
SBrainXcanAndVis	x	x
SBrainXcanAndMR	x		x
SBrainXcan	x	x	x	x

Let’s say we want to run through everything, i.e. replacing [rule-name] with SBrainXcan. So, we can do:

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  -n -p \
  SBrainXcan

We use -n options in snakemake to perform a dry-run to see what will be run (-p is to print more information about each step). The MR runs won’t be printed since we’ve not known which IDPs will be significant but these will be updated automatically during the computation.

If everything looks good, we could start running the command (-j1 tells snakemake to use one core).

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  -j1 -p \
  SBrainXcan

It takes about 9 minutes with 4 Gb memory. The output files are put in folder brainxcan_test/ with prefix SCZ.

6 Looking at the output of BrainXcan

The output files of the above run can be downloaded from here. Here is an overview of the files.

brainxcan_test
├── SCZ.report.Rmd
├── SCZ.report.html
├── SCZ.sbrainxcan.csv
├── SCZ.vis.T1.pdf
├── SCZ.vis.dMRI.pdf
└── SCZ_files
    ├── MR_results
    │   ├── dmri_IDP-25687.MR.rds
    │   ├── dmri_IDP-25687.MR_sumstats.tsv
    │   ├── dmri_IDP-25687.MR_vis.pdf
    │   ├── dmri_PC-FA-TBSS-1.MR.rds
    │   ├── dmri_PC-FA-TBSS-1.MR_sumstats.tsv
    │   ├── dmri_PC-FA-TBSS-1.MR_vis.pdf
    │   ├── dmri_PC-ICVF-ProbTrack-1.MR.rds
    │   ├── dmri_PC-ICVF-ProbTrack-1.MR_sumstats.tsv
    │   ├── dmri_PC-ICVF-ProbTrack-1.MR_vis.pdf
    │   ├── dmri_PC-ICVF-TBSS-1.MR.rds
    │   ├── dmri_PC-ICVF-TBSS-1.MR_sumstats.tsv
    │   ├── dmri_PC-ICVF-TBSS-1.MR_vis.pdf
    │   ├── t1_IDP-25020.MR.rds
    │   ├── t1_IDP-25020.MR_sumstats.tsv
    │   └── t1_IDP-25020.MR_vis.pdf
    └── logs
        ├── dmri_IDP-25687.mr.log
        ├── dmri_IDP-25687.mr_vis.log
        └── [more MR log files]

[prefix].sbrainxcan.csv: BrainXcan association results in CSV format.
[prefix].vis.T1.pdf: Visualization of associations of structural IDPs.
[prefix].vis.dMRI.pdf: Visualization of associations of diffusion IDPs (only TBSS IDPs).
[prefix]_files/MR_results: Mendelian randomization for each significant IDPs
- *.MR_vis.pdf: Plotting effect sizes of instrument SNPs in MR.
- *.MR_sumstats.tsv: Results of MR tests summarized in a table (TSV format).
- *.MR.rds: Raw data used for MR test.
[prefix].report.html: The automated HTML report summarizing the association and MR results (for diffusion IDPs, only TBSS ones are shown).
[prefix]_files/logs/: Log files of each analysis module.

6.1 Association output table

Let’s take a quick look at brainxcan_test/SCZ.sbrainxcan.csv (first two rows). The descriptions of each columns are at the bottom of the table.

Error in dplyr::bind_rows(): ! Can’t combine ..1$bhat and ..2$bhat . Backtrace: 1. … %>% pander(split.cell = 2, emphasize.strong.rows = 3) 3. dplyr::bind_rows(…)

6.2 Mendelian randomization output table

Let’s take a quick look at brainxcan_test/SCZ_files/MR_results/dmri_PC-FA-TBSS-1.MR_sumstats.tsv (first two rows). The descriptions of each columns are at the bottom of the table.

direction	method	nsnp	b	pval	Q	Q_df	Q_pval
IDP -> Phenotype	MR Egger	48	0.105	0.462	228	46	6.66e-26
IDP -> Phenotype	Weighted median	48	0.0324	0.258	NA	NA	NA
MR direction	MR method	Number of SNPs used in MR	Slope in MR	P-value in MR	Heterogenity test Q	Heterogenity test degree of freedom	Heterogenity test p-value

6.3 Visualization and report

We provide interactive HTMLs showing regions in the visualization:

These HTMLs can be generated along the BrainXcan visualization by setting bxcan_region_vis=True in the configuration (default: False). The output files will be put inside the folder called [prefix].vis.region_vis/.

7 Extra notes on using `snakemake`

The BrainXcan software is essentially a data pipeline containing multiple steps and intermediate files. It is implemented using snakemake Mölder et al. (2021) and the features provided by snakemake are directly applicable here.

For instance, the parameters in the configuration file can be replaced by --config, e.g. if we want to change the prefix of the output, we can do it without changing the config file:

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  -j1 -p \
  SBrainXcan \
  --config prefix=another_test

More over, if the intermediate files are already present, snakemake will not re-generate them. In other word, if you run SBrainXcanOnly first and then run SBrainXcanAndVis, it won’t run BrainXcan association again. If you want to enforce re-runing, use the flag --forceall.

BrainXcan software is written with snakemake v6. For more cool features of snakemake, please refer to snakemake documentation.

References

Mölder, Felix, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, et al. 2021. “Sustainable Data Analysis with Snakemake.” F1000Research 10.

Ripke, Stephan, James TR Walters, Michael C O’Donovan, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. 2020. “Mapping Genomic Loci Prioritises Genes and Implicates Synaptic Biology in Schizophrenia.” MedRxiv.

An example BrainXcan run

1 About

2 Downloading BrainXcan database

3 Downloading GWAS

4 Preparing the BrainXcan configuration file

4.1 Specifying input/output paths

4.2 Specifying GWAS meta information

4.3 Specifying parameters in BrainXcan

5 Running BrainXcan

6 Looking at the output of BrainXcan

6.1 Association output table

6.2 Mendelian randomization output table

6.3 Visualization and report

7 Extra notes on using snakemake

References

7 Extra notes on using `snakemake`