How to run

Guidance execution

The details on how to how to install and how to run, as well as the downloadable links are available through this github repository:

https://gitlab.bsc.es/computational-genomics/guidance

The steps to execute GUIDANCE are also described below:

1. Configuration file description

1.1 - Input description

1.2 - Configuration file examples

1.3 - Environment file example

2. GUIDANCE execution

2.1 - Running on a Singularity image

2.2 - Running on bare metal

3. Cloud Utils

3.1 - Commands

3.2 - Executing GUIDANCE

3.3 - Offline execution

3.4 - Contributing

3.5 - Disclaimer

3.6 - License

1. Configuration file description

In order to run GUIDANCE the user has to edit a configuration file, where the basic input and output characteristics have to be specified. This file also allows the tuning of multiple parameters related to, among others, covariates, chunk size for genotype imputation, info scores, minor allele frequency and Hardy-Weinberg thresholds. The user can also decide to run several phenotypes and several combinations of phenotypes/covariates in a single run.

A detailed description of each of the configurable parameters (configuration_file_parameters.pdf) is provided in this file or in the collapsable list placed at the end of this paragraph.

Show the configuration file parameters

wfDeep: Name that defines the number of stages to be executed. These stages are defined in Figures 1 and 2.
init_chromosome: First chromosome to analyse.
end_chromosome: Last chromosome to analyse.
maf_threshold: Minor allele frequency cut-off used to filter final results.
impute_threshold: IMPUTE2 info score cut-off used to filter final results.
minimac_threshold: MINIMAC Estimated imputation accuracy (R²) cut-off used to filter final results.
hwe_cohort_threshold: Hardy-Weinberg equilibrium p.value threshold for cohort.
hwe_cases_threshold: Hardy-Weinberg equilibrium p.value threshold for cases.
hwe_controls_threshold: Hardy-Weinberg equilibrium p.value threshold for controls.
exclude_cgat_snps: Logical. Whether or not G>C or A>T SNPs should be excluded. We strongly recommend activating this flag as to avoid strand orientation issues. Most of the genotyping arrays have a very small number of such SNPs, and their exclusion should not result in any noticeable loss of imputation performance.
imputation_tool: The name of the imputation tool to impute genotypes. To date, only "impute" to select IMPUTE2 and "minimac" to select MINIMAC4 are accepted.
test_types: Names for the different analysis to be carried out by GUIDANCE, separated by commas. The association results for each "test_type" will be created in a directory with the same name inside the "associations" directory. Below this flag, different "test_types" have to be listed with the phenotype name and the covariates names to take into account in the association analysis (for instance, to analyse "test_types = DIA2,CARD" users should add: "DIA2 = DIA2:sex,BMI" and "CARD = CARD:sex,BMI" below, where sex and BMI are covariates).
chunk_size_analysis: Size of the chunks considered to partition the data.
file_name_for_list_of_stages: File into which all the commands launched in the workflow are stored.
input_format: (I think that now we only support BED input since we have not tried with the other formats since I am working in the project…).
mixed_cohort: Name of the cohort.
mixed_bed_file_dir: The path to the directory with genotype files.
mixed_bed/bim/fam/_file: Name of the file containing genotypes.
mixed_sample_file_dir: Path to the directory where the sample file is located.
mixed_sample_file:. Name of the sample file.
genmap_file_dir: Path where genetic map files are located.
genmap_file_chr_n: Name of the genetic map file for each chromosome in every new line.
refpanel_number: Number of reference panels.
refpanel_combine: 'NO' if there is only one panel or imputed results from different reference panels should not be integrated; 'YES' when different reference panels are expected to be used in the analysis and also the integration of all the results is required.
refpanel_type: Name of the reference panel.
refpanel_memory: Required amount of memory demanded by each particular panel. Currently, "HIGH", "MEDIUM" and "LOW" are supported.
refpanel_file_dir: Path where the reference panel for each chromosome is located.
refpanel_hap_file_chr_n: Haplotypes files per chromosome of the reference panel provided in case IMPUTE2 is chosen as imputation tool and for the chrX in case Minimac4 is used.
refpanel_leg_file_chr_n: Legend files per chromosome of the reference panel provided in case IMPUTE2 is chosen as imputation tool and for the chrX in case Minimac4 is used.
refpanel_vcf_file_chr_n: VCF files per chromosome of the reference panel provided in case Minimac4 is used.
outputdir: The path of the directory where the results will be written.

1.1 Input description

GUIDANCE accepts PLINK format files, e.g. gwas.bed, gwas.bim, gwas.fam), as input. It also accepts covariates, such as principal compontents, gender or any covariate defined in the sample file. As part of the input, GUIDANCE also requires the inclusion of one, or several reference panels for imputation accepting public (1KG phase 1 or 3, HapMap, DCEG, UK10K, GoNL) or private reference panels. Also a genetic map is needed.

Sample File

The user needs to also provide an adequate sample file. Please, check SNPTESTv2 webpage for information on how to prepare a suitable sample file.

Genetic Map File

All of IMPUTE2 reference panel download packages come with appropriate recombination map files. Check IMPUTE2 webpage for more information. Hence, this genetic map files can be used for phasing with SHAPEIT. A file per chromosome must be given.

On the other hand, when phasing with EAGLE, a single file must be given. Several compatible options are available on the Broad Website.

Reference panels

It must be noted that, when using IMPUTE2, the reference panels must be given with format .haps and .leg. On the other hand, Minimac4 only accepts M3VCF. Regular VCF can be converted to M3VCF using Minimac3.

In addition, we have encountered some problems when imputing the ChrX with Minimac4. This is why the imputation of this chromosome is always done with IMPUTE2, so even when using Minimac4 both .haps and .legend must be given for ChrX if this chromosome is included in the study.

1.2 Configuration file examples

We give some templates of configuration files for a study from Chr21 to ChrX, using HRC, 1kphase3, uk10k and gonl as reference panels and Eagle-IMPUTE2, SHAPEIT-IMPUTE2, Eagle-Minimac4 and SHAPEIT-Minimac4 as phasing and imputation tools.

1.3 Environment file example

The constraints presented here have been optimized to run efficiently in our tests dataset example, which consists on whole-genome imputation with 1000 Genomes phase 1 and UK10K reference panels of 4,672 samples (cases and controls).

#!/bin/bash

### PHASE 1 ###

export phasingMem="50.0"
export phasingCU="48"
export phasingBedMem="50.0"
export phasingBedCU="48"

### PHASE 2 ###

export qctoolMem="16.0"
export qctoolSMem="1.0"
export gtoolsMem="6.0"
export samtoolsBgzipMem="6.0"
export imputeWithImputeLowMem="8.0"
export imputeWithImputeMediumMem="12.0"
export imputeWithImputeHighMem="20.0"
export imputeWithMinimacLowMem="4.0"
export imputeWithMinimacMediumMem="8.0"
export imputeWithMinimacHighMem="32.0"
export filterByInfoImputeMem="12.0"
export filterByInfoMinimacMem="24.0"

### PHASE 3 ###

export createListOfExcludedSnpsMem="1.0"
export filterHaplotypesMem="1.0"
export filterByAllMem="1.0"
export jointFilteredByAllFilesMem="15.0"
export jointCondensedFilesMem="1.0"
export generateTopHitsAllMem="2.0"
export generateTopHitsMem="2.0"
export filterByMafMem="2.0"
export snptestMem="2.0"
export mergeTwoChunksMem="1.0"
export mergeTwoChunksInTheFirstMem="1.0"
export combinePanelsMem="1.0"
export combineCondensedFilesMem="1.0"
export combinePanelsComplex1Mem="1.0"

### PHASE 4 ###

export generateCondensedTopHitsCU="48"
export generateCondensedTopHitsMem="90.0"
export generateQQManhattanPlotsCU="24"
export generateQQManhattanPlotsMem="45.0"
export phenoMergeMem="80.0"

This constraints correspond to all the phases executed during an execution. The most part of them, should be leaved like here. Nevertheless, some of them should be tuned dependeing on the execution:

phasingMem: when setting this parameter, it should be taken into account that only one task per chromosome will be created. Hence, it should be set in such a way that all chromosomes can start being phased from the beggining but, at the same time, holding as many resources as possible.

imputeWithImputeX: this is the amount of memory used by IMPUTE when imputing the different chunks. This parameter will depend on the size of the used panel as well as the size of the input. Indeed, the greater the cohor, the greater the amount of memory needed.

imputeWithMinimacX: this is the amount of memory used by Minimac when imputing the different chunks. This parameter will depend on the size of the used panel as well as the size of the input. Indeed, the greater the cohor, the greater the amount of memory needed.

generateX: this corresponds to the end-files generation. As in the first step, should be set as high as possible as long as all the possible executions can run at once.

2. GUIDANCE execution

2.1 Running on a Singularity image

An example of a quite complete launch script with singularity is shown next:

#!/bin/bash -e

export COMPSS_PYTHON_VERSION="2"

module purge
module load intel/2018.1
module load singularity/2.4.2

base_dir=$(pwd)
work_dir=${base_dir}/logs/
worker_work_dir=${base_dir}/tmpForCOMPSs/
worker_work_dir=scratch

source $base_dir/set_environment.sh

exec_time=2880
num_nodes=50
tracing=true
graph=true
debug=off
cpus_per_node=48
worker_in_master_cpus=0
worker_in_master_memory=80000
qos=bsc_cs

mkdir -p ${base_dir}/outputs_directory

/path/to/COMPSs/Runtime/scripts/user/enqueue_compss \
--qos=${qos} \
--job_dependency=5804941 \
--graph=${graph} \
--tracing=${tracing} \
--log_level=${debug} \
--exec_time=${exec_time} \
--num_nodes=${num_nodes} \
--base_log_dir=${base_dir} \
--worker_in_master_cpus=${worker_in_master_cpus} \
--worker_in_master_memory=${worker_in_master_memory} \
--cpus_per_node=${cpus_per_node} \
--master_working_dir=${work_dir} \
--worker_working_dir=${worker_work_dir} \
--scheduler="es.bsc.compss.scheduler.fifodatanew.FIFODataScheduler" \
--classpath=/path/to/guidance.jar \
--jvm_workers_opts="-Dcompss.worker.removeWD=true" \
--container_image=/path/to/guidance_singularity.img \
guidance.Guidance -config_file ${base_dir}/config_GERA_5000_shapeit_impute_1_23_cloud.file

Modules needed to run the execution.
work_dir: the folder where the log files are stored.
worker_work_dir: the folder where all the temporary files will be stored. If tmp, each worker node will use its /tmp. Otherwise, a shared directory between all the nodes should be specified (in general, this means pointing to an nfs, gpfs or lustre directory.
File with all the environment variables pointing out the memory constraints necessary to run the execution.
General constraints for the queue system (COMPSs will correctly traduce them to whichever installation present in the cluster).
Creating the folder where the output files will be placed. Should be equal to the one stated in the configuration file.

Afterwards, in the launch command, there are 3 important files:

guidance_25_09_03_20_0_1_1.jar: GUIDANCE binary.
guidance_singularity.img: generated singularity image.
config_GERA_5000_shapeit_impute_1_23_cloud.file: configuration file.

It is important to keep in mind that the output directory created should be equal to the one specified in the configuration file.

2.2 Running on bare metal

#!/bin/bash -e

module load COMPSs
module load mkl
module load intel/2017.4
module load samtools/1.5
module load R/3.5.1
module load bcftools/1.8
module load gcc/5.4.0

base_dir=$(pwd)
work_dir=${base_dir}/logs/
worker_work_dir=${base_dir}/tmpForCOMPSs/

export BCFTOOLSBINARY=/path/to/BCFTOOLS/1.8/INTEL/bin/bcftools
export RSCRIPTBINDIR=/path/to/R/3.5.1/INTEL/bin/
export SAMTOOLSBINARY=/path/to/SAMTOOLS/1.5-DNANEXUS/INTEL/IMPI/bin

export PLINKBINARY=/path/to/TOOLS/apps_gwimp_compss/plink_1.9/plink
export EAGLEBINARY=/path/to/TOOLS/Eagle_v2.4.1/eagle
export RSCRIPTDIR=/path/to/R_SCRIPTS/
export QCTOOLBINARY=/path/to/TOOLS/qctool_v1.4-linux-x86_64/qctool
export SHAPEITBINARY=/path/to/TOOLS/shapeit.v2.r727.linux.x64
export IMPUTE2BINARY=/path/to/TOOLS/impute_v2.3.2_x86_64_static/impute2
export SNPTESTBINARY=/path/to/TOOLS/snptest_v2.5
export MINIMAC3BINARY=/path/to/TOOLS/Minimac3/bin/Minimac3
export MINIMAC4BINARY=/path/toTOOLS//Minimac4/release-build/minimac4
export TABIXBINARY=/path/to/SAMTOOLS/1.5-DNANEXUS/INTEL/IMPI/bin/tabix
export BGZIPBINARY=/path/to/SAMTOOLS/1.5-DNANEXUS/INTEL/IMPI/bin/bgzip

export R_LIBS_USER=/path/to/TOOLS/R_libs/

export LC_ALL="C"

source $base_dir/set_environment.sh

exec_time=700
num_nodes=25
tracing=true
graph=true
log_level=off
cpus_per_node=48
worker_in_master_cpus=0
worker_in_master_memory=80000
qos=bsc_cs

mkdir -p ${base_dir}/outputs_shapeit_impute_1909_erase_all

enqueue_compss \
--qos=${qos} \
--job_dependency=7403259 \
--graph=${graph} \
--tracing=${tracing} \
--log_level=${log_level} \
--exec_time=${exec_time} \
--num_nodes=${num_nodes} \
--base_log_dir=${base_dir} \
--worker_in_master_cpus=${worker_in_master_cpus} \
--worker_in_master_memory=${worker_in_master_memory} \
--cpus_per_node=${cpus_per_node} \
--master_working_dir=${work_dir} \
--worker_working_dir=${worker_work_dir} \
--scheduler="es.bsc.compss.scheduler.fifodatanew.FIFODataScheduler" \
--classpath=${base_dir}/guidance_25_1909_erase.jar \
--jvm_workers_opts="-Dcompss.worker.removeWD=true" \
guidance.Guidance -config_file ${base_dir}/config_GERA_300_shapeit_impute_1909_erase_all.file

In the next list, the most important features are explained in the same order as they appear in the script:

Modules needed to run COMPSs.
work_dir: the folder where the log files are stored.
worker_work_dir: the folder where all the temporary files will be stored. If tmp, each worker node will use its /tmp. Otherwise, a shared directory between all the nodes should be specified (in general, this means pointing to an nfs, gpfs or lustre directory.
Environment variables pointing to where all the needed binaries are placed.
File with all the environment variables pointing out the memory constraints necessary to run the execution.
General constraints for the queue system (COMPSs will correctly traduce them to whichever installation present in the cluster).
Creating the folder where the output files will be placed. Should be equal to the one stated in the configuration file.

Afterwards, in the launch command, there are 3 important files:

guidance_25_09_03_20_0_1_1.jar: GUIDANCE binary.
config_GERA_5000_shapeit_impute_1_23_cloud.file: configuration file.

It is important to keep in mind that the output directory created should be equal to the one specified in the configuration file.

3. Cloud Utils

Utils to configure Guidance and COMPSs in a cloud environment.

The details on how to configure is available through this github repository
https://gitlab.bsc.es/computational-genomics/guidance_cloud

3.1 Commands

This commands take into account the information presented in a configuration file.

./create_snapshot.sh -h
./create_cluster.sh -h

Create snapshot

Before launching any execution, the snapshots that will serve as base to create the cluster master and workers need to be created. The most important information supplied in this step is the available amount of space in the disk both in the master and worker nodes. Once this variables have been correctly set, launching the command as follows will create both snapshots:

./create_snapshot.sh --props=production.props

It is possible to store as many property files as wanted. They must be placed in the props folder.

Create cluster

Once the snapshots have been created (the same snapshot can serve as base for several runs) a cluster with the amount of requested nodes can be created in order to launch a COMPSs execution.

./create_cluster.sh --props=production.props

Configuration file

It is possible to store as many property files as wanted. They must be placed in the props folder. The following configuration corresponds to the biggest execution performed until now.
Show properties description

3.2 Executing GUIDANCE

Once the cluster has been created, the following actions must be performed:

Copy all the necessary files to the cluster
Copy the configuration file to the cluster
SSH into the master machine and execute the file launch.sh and wait until the execution finishes
Copy all the files that need to be stored into the bucket or any other persistent disk
Destroy the cluster through the Google Cloud's console

3.3 Offline execution

With the previous instruction, GUIDANCE is launched directly in a console that should remain open during the whole execution. Nevertheless, in order to be able to shut down the local machine, the next command could be used:

ssh user@{master_ip} "/home/user/launch.sh > /home/user/output.txt 2> error.txt &"&

This way, the output is stored in the supplied file instead of the console. This enables shutting down the local machine at the same time that it is possible to check the progress of the execution.

3.4 Contributing

All kinds of contributions are welcome. Please do not hesitate to open a new issue, submit a pull request or contact the author if necessary.

3.5 Disclaimer

This is part of a collaboration between the Computational Genomics group and the Workflows and Distributed Computing Team group at the BSC and is still under development.

3.6 License

Licensed under the Apache 2.0 License