SMuFin - Manual

Introduction

SMuFin is a method designed at the Barcelona Supercomputing Center for detecting somatic variation in tumor genomes by directly comparing the reads of a tumor genome against a normal one from the same individual. SMuFin does not require a previous alignment step of tumor and normal reads with a reference genome. Instead it uses directly the reads from FASTQ or BAM files to identify somatic variants, which are finally mapped onto the provided reference.

Please note that, because SMuFin basically identifies all detectable differences between two sequence samples, it can be potentially used for a broad range of purposes. However, because the current version (as described here) has been calibrated and tuned to identify somatic variants in the context of cancer genomics using tumor and normal genomes from the same individual, we cannot assure the same reported sensitivity and specificity ranges if used outside this context.

How to Install

In order to build SMuFin from the source code, the user must have the following tools and libraries installed:

C/C++ Compiler (GCC >= 4.3.X)

MPI Environment (OpenMPI >= 1.5.X)

To build SMuFin you just run the following commands:

   $> tar -zxvf smufin_0.9.3_mpi_beta.tar.gz

   $> cd smufin_0.9.3_mpi_beta

   $> make

System Requirements

SMuFin MPI distributes the genomic load among different nodes of a cluster. Therefore a minimum number of computing nodes have to be first determined. For example, a human genome dataset of 30x sequencing coverage typically requires around 250GB RAM to be distributed among all defined MPI instances. An approximate estimate of the required number of nodes for each sample type can be obtained using the following formula:

MT = Mt + Mn

M(t or n) = L * C * 2.3

N = MT / R

where:

	Mt = Memory necessary for the tumor genome

	Mn = Memory necessary for the normal genome

	MT = The total approximate amount of memory required 
	     for a complete analysis of a tumor-normal pair

	L = Genome length in Megabase pairs covered by the sequencing

	C = Sequencing coverage

	M(T, t or n) = Required RAM memory in MBytes

	R = Memory available per user in each node (in MBytes)

	N = Minimum number of computing nodes required

For example, a complete analysis of tumor and normal pair with an approximate L of 3000 and sequenced at C = 30x each, would require, in a cluster machine with 32GB RAM per node, a minimum of N = (3000 * 30 * 2.3)*2 / 32000 ~ 13 computing nodes.

Beyond these calculations, we recommend to use more nodes than the strict minimum estimates.

How to Run SMuFin

SMuFin can be run using FASTQ or BAM files.

In order to run SMuFin the paths for the Tumor and Normal samples, as well as for the reference genome, must be specified. The reference genome must be in FASTA format, whereas the normal and tumor reads in FASTQ. Alternatively, we also provide the option of using BAM files as a source for the reads. Because raw sequences are extracted from BAM files, optimal results will be obtained if BAM files containing both, aligned and unaligned sequence reads are used.

Usage: SMuFin < command > [ options ]

Commands:

   --ref		<FILE>	Reference genome in FASTA format

   --normal_bam		<FILE>	Use Normal BAM file instead of FASTQ

   --tumor_bam		<FILE>	Use Tumor BAM file instead of FASTQ

   --normal_fastq_1*	<FILE>	Normal FASTQ 1st Paired-End file

   --normal_fastq_2*	<FILE>	Normal FASTQ 2nd Paired-End file

   --tumor_fastq_1*	<FILE>	Tumor FASTQ 1st Paired-End file

   --tumor_fastq_2*	<FILE>	Tumor FASTQ 2nd Paired-End file

   --tumor_cont_perc	<NUM>	Expected percentage of tumor contamination in normal 
				dataset 0-100

   --min_supp_reads	<NUM>	The minimum number of tumor supporting reads required for 
				calling a variant <default 4>

   --cpus_per_node	<NUM>	Number of cpus to be used for each node

   --patient_id		<TEXT>	Text appended to each of the outputed filenames

* The order in the fastq files between pairs must be the same

Running SMuFin example from FASTQ files

Before using SMuFin for production, we advise the user to test its performance in a local platform by using the example dataset provided.

The example dataset corresponds to the chr22 Normal-Tumor pair samples from the in-silico genome used here .

The dataset files are distributed as:

	File name	Content file (file name of paired-end reads)
normal_fastqs_1.txt (normal files 80bps reads 500 insert size)	fasta_files/chr22_insilico_Normal_30x_3_1.fastq.gz * fasta_files/chr22_insilico_Normal_30x_4_1.fastq.gz fasta_files/chr22_insilico_Normal_30x_5_1.fastq.gz fasta_files/chr22_insilico_Normal_30x_6_1.fastq.gz fasta_files/chr22_insilico_Normal_30x_7_1.fastq.gz
normal_fastqs_2.txt (normal files 80bps reads 500 insert size)	fasta_files/chr22_insilico_Normal_30x_3_2.fastq.gz * fasta_files/chr22_insilico_Normal_30x_4_2.fastq.gz fasta_files/chr22_insilico_Normal_30x_5_2.fastq.gz fasta_files/chr22_insilico_Normal_30x_6_2.fastq.gz fasta_files/chr22_insilico_Normal_30x_7_2.fastq.gz
tumor_fastqs_1.txt (tumor files 80bps reads 500 insert size)	fasta_files/chr22_insilico_Tumor_30x_10_1.fastq.gz fasta_files/chr22_insilico_Tumor_30x_11_1.fastq.gz fasta_files/chr22_insilico_Tumor_30x_8_1.fastq.gz fasta_files/chr22_insilico_Tumor_30x_9_1.fastq.gz
tumor_fastqs_2.txt (tumor files 80bps reads 500 insert size)	fasta_files/chr22_insilico_Tumor_30x_10_2.fastq.gz fasta_files/chr22_insilico_Tumor_30x_11_2.fastq.gz fasta_files/chr22_insilico_Tumor_30x_8_2.fastq.gz fasta_files/chr22_insilico_Tumor_30x_9_2.fastq.gz

* The file name of paired-end reads must be specified in the same line of each file.

The example dataset can be runned using this command line:

mpirun --np 16 ./SMuFin --ref ref_genome/hg19.fa --normal_fastq_1 normal_fastqs_1.txt --normal_fastq_2 normal_fastqs_2.txt --tumor_fastq_1 tumor_fastqs_1.txt --tumor_fastq_2 tumor_fastqs_2.txt --patient_id chr22_insilico --cpus_per_node 16

NOTE: This command will run SMuFin with default parameters:

    --min_supp_reads 4: Minimum number of tumor supporting reads required for calling a variant.

    --tumor_cont_perc 0: Expected percentage of tumor contamination in normal sample.

Output files

SMuFin provides three output files:

    somatic_SNV.txt: for Single Nucleotide Variations.

    somatic_small_SVs.txt: for small SVs.

    somatic_large_SVs.txt: for breakpoints of large SVs.

These 3 outputs refer to the three categories of somatic events: SNV, small SVs (deletions, insertions, inversions) and large SVs (breakpoints). The "small" and "large" correspond to variants smaller or larger than the read size.

The file content of outputs are:

somatic_SNV.txt
Mut_ID	SNV ID
Type	Mutation type. In this file always is SNV
Chr	Reference chromosome id
Pos	Reference 1-based SNV coordinate
Normal_NT	Nucleotide found in normal genome
Tumor_NT	Mutated nucleotide found in tumor genome

somatic_small_SVs.txt
Mut_ID	Small SV ID
Type	Mutation type: DEL, INS or INV
Chr	Reference chromosome id
Pos	Reference 1-based of the previous position of the small event
Size	The length of the small SV
Sequence	Sequence inserted in INS

somatic_large_SVs.txt

Mut_ID

Large BKP ID

Type

Mutation type: BKP

Chr_BKP_1

Reference chromosome id from the sequence before the breakpoint

Pos_BKP_1

Reference 1-based from the position just before the breakpoint

Chr_BKP_2

Reference chromosome id from the sequence after the breakpoint

Pos_BKP_2

Reference 1-based from the position just after the breakpoint

local(left_strand
left_ini..left_end
right_strand
right_ini..right_end)

left_strand	Mapping strand of the left part of "Ext_Sequence"
local_left_ini	Initial local mapping offset of the left part of "Ext_Sequence"
local_left_end	Final local mapping offset of the left part of "Ext_Sequence"
right_strand	Mapping strand of the right part of "Ext_Sequence"
local_right_ini	Initial local mapping offset of the right part of "Ext_Sequence"
local_right_end	Final local mapping offset of the right part of "Ext_Sequence"

Ext_Sequence

The genomic sequence extension around breakpoint. < 200bps