1 About

In this post, we run BrainXcan with an example.

First of all, please make sure you’ve cloned Brainxcan code repository and installed all the BrainXcan software dependencies.

To perform BrainXcan analysis, we need to prepare the following files:

  • The BrainXcan database and meta files (download from brainxcan database zenodo link)
    • To run S-BrainXcan with permutation within LD blocks, you will also need an LD block BED file, e.g. link and we have it copied here as well
  • A file containing GWAS summary statistics. It should have the following information:
    • SNP rsID (we use rsID as SNP identifier) and chromosome of the SNP.
    • Effect and non-effect allele
    • Summary statistics (one in the two groups):
      1. Group 1: effect size, standard error of effect size (zero-centered)
      2. Group 2: z-score, GWAS sample size, allele frequency
  • A pipeline configuration file.

2 Downloading BrainXcan database

The BrainXcan database and meta files can be downloaded from brainxcan database zenodo link. The directory structure is

brainxcan_data/
|-- bxcan_vis
|-- geno_covar
|-- idp_gwas
|-- idp_weights
`-- mr

There is no need to change the structure and it is designed to be used asis.

See some detailed notes on the BrainXcan database at “Looking into BrainXcan database.”

3 Downloading GWAS

In this example, we provide a pre-formatted GWAS file at here: preview link and direct download link, which is adapted from a schizophrenia GWAS by Ripke et al. (2020) (medRxiv link).

First, let’s download and take a look at it.

# download
wget https://uchicago.box.com/shared/static/s2yg1gvxm5vv1q3yijf5h389o3hkadwv.gz \
  -O formatted.SCZ_PGC_2020.tsv.gz
  
# look at the first few rows
zcat < formatted.SCZ_PGC_2020.tsv.gz | head

# returns
variant_id  chromosome  non_effect_allele   effect_allele   effect_size standard_error  zscore  sample_size frequency
rs3131972   chr1    G   A   -0.0020020026706730793  0.0111  -0.18439957 71533   0.1652
rs3131969   chr1    G   A   -0.006601743634562364   0.0118  -0.56070304 71533   0.1339
rs3131967   chr1    C   T   -0.005796768847177342   0.0119  -0.48651794 71533   0.1171
rs1048488   chr1    T   C   0.002696361547742533    0.0117  0.2303755   71533   0.1667
rs12562034  chr1    G   A   -0.003902375817241093   0.0156  -0.24843223 66899   0.1027
rs12124819  chr1    A   G   -0.011800104115750618   0.0181  -0.64967835 58941   0.2243
rs4040617   chr1    A   G   -0.0010993954433010642  0.0127  -0.09036144 71843   0.1295
rs4970383   chr1    C   A   0.02810116511090993 0.0111  2.5433347   64548   0.1964
rs4475691   chr1    C   T   0.012102946067745635    0.0121  0.9998151   65154   0.1429

4 Preparing the BrainXcan configuration file

TLDR: The configuration file is in YAML format and the one for example run can be downloaded from here. And there are a few fields that you need to fill in.

If you are not in a hurry, we recommend going through the following sections which explain the configuration file in details.

The configuration file contains three main parts:

  • Input/output file paths.
  • Meta information about the GWAS file.
  • Some parameter settings in BrainXcan.

We will go over them one by one.

4.1 Specifying input/output paths

  • datadir: Path to the BrainXcan database folder.
  • outdir: Output directory name (if it does not exist, it will be generated).
  • prefix: Prefix of the output files (to distinguish between results in the same output folder).
  • brainxcan_path: Path to the BrainXcan code repository.

In the configuration YAML, we have:

datadir: 'path-to-braixcan-database/brainxcan_data'
outdir: 'brainxcan_test'
prefix: 'SCZ_PGC_2020'
brainxcan_path: 'path-to-brainxcan-repository/brainxcan'

4.2 Specifying GWAS meta information

Here we want to tell the software how to understand the GWAS file.

  • gwas: Path to GWAS file (formats: csv(.gz), tsv(.gz), parquet).
  • snpid: Column name of GWAS SNP rsID
  • effect_allele: Column name of GWAS effect allele
  • non_effect_allele: Column name of GWAS non-effect allele
  • chr: Column name of chromosome
  • position: Column name of position

Along with Group 1 or 2 (see details at here):

  • Group 1:
    • effect_size: Column name of GWAS effect size
    • effect_size_se: Column name of GWAS effect size standard error
  • Group 2:
    • zscore: Column name of GWAS z-score
    • sample_size: Column name of GWAS sample size
    • allele_frequency: Column name of allele frequency

When both groups are specified, group 1 has high priority. For the example run, we go with group 1.

In the configuration YAML, we have:

gwas: 'formatted.SCZ_PGC_2020.tsv.gz'
snpid: 'variant_id'
effect_allele: 'effect_allele'
non_effect_allele: 'non_effect_allele'
chr: 'chromosome'
effect_size: 'effect_size'
effect_size_se: 'standard_error'
position: 'position'

4.3 Specifying parameters in BrainXcan

There are several parameters underlying a BrainXcan run. They are associated with three aspects:

  • BrainXcan association test
    • model_type: IDP prediction model type: ridge or elastic_net (Default = ridge)
    • idp_type: IDP sets to use: original or residual (after PC adjustment) (Default = residual)
    • spearman_cutoff: CV Spearman cutoff on models (only models passing this criteria will be shown) (Default = 0.1)
  • BrainXcan adjustment options
    • (Recommended) Adjusting z-scores with LD block-based permutation
      • bxcan_ldblock_perm: If want to run this adjustment, set the path to the BED file (TAB-delimited base0 with header chr, start, stop)
      • bxcan_ldblock_perm_seed: Random seed
      • bxcan_ldblock_perm_nrepeat: Number of permutation across all IDPs
      • correction_factor_perm: Correcting the standard error of the simulated z-scores by multiplying with this factor (default = 1.1)
    • (Not recommended) Adjusting z-scores via simulated weights
      • bxcan_empirical_z: Whether to run this adjustment or not
      • bxcan_empirical_z_seed: Random seed
      • bxcan_empirical_z_nrepeat: Number of simulated IDPs
      • correction_factor_emp: Correcting the standard error of the simulated z-scores by multiplying with this factor
  • Mendelian randomization (MR): extrating a list of significant IDP and run MR
    • signif_pval: P-value cutoff to call an IDP significant (Default = 1e-5)
    • signif_max_idps: Maximun number of IDPs to run MR (discard by p-value if too many IDPs passing p-value filter) (Default = 10)
    • ld_clump_yaml: Parameters of LD clumping for defining the instrument SNPs in MR (Default = {datadir}/mr/ld_clump.yaml)\(\dagger\)
  • Dependent executables
    • rscript_exe: Path to the Rscript executable (Default = Rscript)
    • python_exe: Path to the Python executable (Default = python)
    • plink_exe: Path to the plink v1.9 executable (Default = plink)

\(\dagger\) {datadir} will be interpreted as path-to-braixcan-database/brainxcan_data internally since we set datadir: path-to-braixcan-database/brainxcan_data.

In the configuration YAML, we have:

model_type: 'ridge'
idp_type: 'residual'
spearman_cutoff: 0.1 

signif_pval: 0.01
signif_max_idps: 20
gwas_pop: 'EUR'
ld_clump_yaml: '{datadir}/mr/ld_clump.yaml'

rscript_exe: 'Rscript' 
python_exe: 'python'
plink_exe: 'plink'

The full list of parameters available in a BrainXcan configuration YAML is explained (with default values) at here

5 Running BrainXcan

Now we are ready to run BrainXcan pipeline. The general command is

# conda activate brainxcan 

REPO=path-to-brainxcan-repository
CONFIG=path-to-config-file

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  [rule-name]

[rule-name] can take the following four options:

rule.name Association Visualization Mendelian.randomization Automated.report
SBrainXcanOnly x
SBrainXcanAndVis x x
SBrainXcanAndMR x x
SBrainXcan x x x x

Let’s say we want to run through everything, i.e. replacing [rule-name] with SBrainXcan. So, we can do:

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  -n -p \
  SBrainXcan

We use -n options in snakemake to perform a dry-run to see what will be run (-p is to print more information about each step). The MR runs won’t be printed since we’ve not known which IDPs will be significant but these will be updated automatically during the computation.

If everything looks good, we could start running the command (-j1 tells snakemake to use one core).

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  -j1 -p \
  SBrainXcan

It takes about 9 minutes with 4 Gb memory. The output files are put in folder brainxcan_test/ with prefix SCZ_PGC_2020.

6 Looking at the output of BrainXcan

The output files of the above run can be downloaded from here. Here is an overview of the files.

brainxcan_test
├── SCZ_PGC_2020.report.html
├── SCZ_PGC_2020.sbrainxcan.csv
├── SCZ_PGC_2020.vis.T1.pdf
├── SCZ_PGC_2020.vis.dMRI.pdf
└── SCZ_PGC_2020_files
    ├── MR_results
    │   ├── dmri_IDP-25687.MR.rds
    │   ├── dmri_IDP-25687.MR_sumstats.tsv
    │   ├── dmri_IDP-25687.MR_vis.pdf
    │   ├── t1_IDP-25020.MR.rds
    │   ├── t1_IDP-25020.MR_sumstats.tsv
    │   ├── t1_IDP-25020.MR_vis.pdf
    |   └── [more MR results]
    └── logs
        ├── dmri_IDP-25687.mr.log
        ├── dmri_IDP-25687.mr_vis.log
        ├── sbrainxcan_dmri.log
        ├── sbrainxcan_t1.log
        ├── sbxcan_merge.log
        ├── sbxcan_report.log
        ├── sbxcan_signif.log
        ├── sbxcan_vis.log
        ├── t1_IDP-25020.mr.log
        ├── t1_IDP-25020.mr_vis.log
        └── [more MR log files]
  • [prefix].sbrainxcan.csv: BrainXcan association results in CSV format.
  • [prefix].vis.T1.pdf: Visualization of associations of structural IDPs.
  • [prefix].vis.dMRI.pdf: Visualization of associations of diffusion IDPs (only TBSS IDPs).
  • [prefix]_files/MR_results: Mendelian randomization for each significant IDPs
    • *.MR_vis.pdf: Plotting effect sizes of instrument SNPs in MR.
    • *.MR_sumstats.tsv: Results of MR tests summarized in a table (TSV format).
    • *.MR.rds: Raw data used for MR test.
  • [prefix].report.html: The automated HTML report summarizing the association and MR results (for diffusion IDPs, only TBSS ones are shown).
  • [prefix]_files/logs/: Log files of each analysis module.

6.1 Association output table

Let’s take a quick look at brainxcan_test/SCZ_PGC_2020.sbrainxcan.csv (first two rows). The descriptions of each columns are at the bottom of the table.

Table continues below
IDP bhat pval z_brainxcan nsnp_used nsnp_total CV_R2
IDP-25020 0.395 5.79e-17 8.37 1069786 1071343 0.0129
IDP-25687 0.353 2.1e-14 7.64 1069786 1071343 0.0128
IDP ID code (UKB field ID) Estimated effect size of the association test P-value of the association test (without adjustment) Z-score of the association test (without adjustment) Number of SNPs used in the calculation Total number of SNPs in the prediction model R2 of the prediction model (cross-validated)
Table continues below
CV_Pearson CV_Spearman modality z_adj_gc pval_adj_gc z_adj_perm_null
0.114 0.109 T1 3.13 0.00176 4.45
0.113 0.11 dMRI 2.86 0.00427 4.07
Pearson correlation of the prediction model (cross-validated) Spearman correlation of the prediction model (cross-validated) T1 (structural) or dMRI (diffusion) Z-score adjusted by GC P-value adjust by GC Z-score adjusted by permutation
pval_adj_perm_null subtype left_or_right region notes
8.53e-06 Subcortical right hippocampus Volume of right hippocampus (from T1 brain image)
4.79e-05 w-OD NA forceps major Weighted-mean OD (orientation dispersion index) in tract forceps major (from dMRI data)
P-value adjusted by permutation IDP subtype Left/right side of the brain Anatomical region Notes of the IDP from UKB

6.2 Mendelian randomization output table

Let’s take a quick look at brainxcan_test/SCZ_PGC_2020_files/MR_results/dmri_IDP-25687.MR_sumstats.tsv (first two rows). The descriptions of each columns are at the bottom of the table.

direction method nsnp b pval Q Q_df Q_pval
IDP -> Phenotype MR Egger 33 -0.591 0.0595 170 31 3.98e-21
IDP -> Phenotype Weighted median 33 0.0394 0.476 NA NA NA
MR direction MR method Number of SNPs used in MR Slope in MR P-value in MR Heterogenity test Q Heterogenity test degree of freedom Heterogenity test p-value

6.3 Visualization and report

We provide interactive HTMLs showing regions in the visualization:

These HTMLs can be generated along the BrainXcan visualization by setting bxcan_region_vis=True in the configuration (default: False). The output files will be put inside the folder called [prefix].vis.region_vis/.

7 Extra notes on using snakemake

The BrainXcan software is essentially a data pipeline containing multiple steps and intermediate files. It is implemented using snakemake Mölder et al. (2021) and the features provided by snakemake are directly applicable here.

For instance, the parameters in the configuration file can be replaced by --config, e.g. if we want to change the prefix of the output, we can do it without changing the config file:

snakemake \
  -s $REPO/Snakefile \
  --configfile $CONFIG \
  -j1 -p \
  SBrainXcan \
  --config prefix=another_test

More over, if the intermediate files are already present, snakemake will not re-generate them. In other word, if you run SBrainXcanOnly first and then run SBrainXcanAndVis, it won’t run BrainXcan association again. If you want to enforce re-runing, use the flag --forceall.

BrainXcan software is written with snakemake v6. For more cool features of snakemake, please refer to snakemake documentation.

References

Mölder, Felix, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, et al. 2021. “Sustainable Data Analysis with Snakemake.” F1000Research 10.
Ripke, Stephan, James TR Walters, Michael C O’Donovan, Schizophrenia Working Group of the Psychiatric Genomics Consortium, and others. 2020. “Mapping Genomic Loci Prioritises Genes and Implicates Synaptic Biology in Schizophrenia.” MedRxiv.