SMuFin - Manual

 
Introduction
SMuFin is a method designed at the Barcelona Supercomputing Center for detecting somatic variation in tumor genomes by directly comparing the reads of a tumor genome against a normal one from the same individual. SMuFin does not require a previous alignment step of tumor and normal reads with a reference genome. Instead it uses directly the reads from FASTQ or BAM files to identify somatic variants, which are finally mapped onto the provided reference.

Please note that, because SMuFin basically identifies all detectable differences between two sequence samples, it can be potentially used for a broad range of purposes. However, because the current version (as described here) has been calibrated and tuned to identify somatic variants in the context of cancer genomics using tumor and normal genomes from the same individual, we cannot assure the same reported sensitivity and specificity ranges if used outside this context.
 
How to Install
In order to build SMuFin from the source code, the user must have the following tools and libraries installed:

C/C++ Compiler (GCC >= 4.3.X)

MPI Environment (OpenMPI >= 1.5.X)


To build SMuFin you just run the following commands:
   $> tar -zxvf smufin_0.9.3_mpi_beta.tar.gz
   $> cd smufin_0.9.3_mpi_beta
   $> make
 
System Requirements
SMuFin MPI distributes the genomic load among different nodes of a cluster. Therefore a minimum number of computing nodes have to be first determined. For example, a human genome dataset of 30x sequencing coverage typically requires around 250GB RAM to be distributed among all defined MPI instances. An approximate estimate of the required number of nodes for each sample type can be obtained using the following formula:
MT = Mt + Mn
M(t or n) = L * C * 2.3
N = MT / R
where:
	Mt = Memory necessary for the tumor genome
	Mn = Memory necessary for the normal genome
	MT = The total approximate amount of memory required 
	     for a complete analysis of a tumor-normal pair
	L = Genome length in Megabase pairs covered by the sequencing
	C = Sequencing coverage
	M(T, t or n) = Required RAM memory in MBytes
	R = Memory available per user in each node (in MBytes)
	N = Minimum number of computing nodes required
For example, a complete analysis of tumor and normal pair with an approximate L of 3000 and sequenced at C = 30x each, would require, in a cluster machine with 32GB RAM per node, a minimum of N = (3000 * 30 * 2.3)*2 / 32000 ~ 13 computing nodes.

Beyond these calculations, we recommend to use more nodes than the strict minimum estimates.
 
How to Run SMuFin
SMuFin can be run using FASTQ or BAM files.
 
In order to run SMuFin the paths for the Tumor and Normal samples, as well as for the reference genome, must be specified. The reference genome must be in FASTA format, whereas the normal and tumor reads in FASTQ. Alternatively, we also provide the option of using BAM files as a source for the reads. Because raw sequences are extracted from BAM files, optimal results will be obtained if BAM files containing both, aligned and unaligned sequence reads are used.
 
Usage: SMuFin < command > [ options ]
Commands:

   --ref		<FILE>	Reference genome in FASTA format 
   --normal_bam		<FILE>	Use Normal BAM file instead of FASTQ
   --tumor_bam		<FILE>	Use Tumor BAM file instead of FASTQ
   --normal_fastq_1*	<FILE>	Normal FASTQ 1st Paired-End file
   --normal_fastq_2*	<FILE>	Normal FASTQ 2nd Paired-End file
   --tumor_fastq_1*	<FILE>	Tumor FASTQ 1st Paired-End file
   --tumor_fastq_2*	<FILE>	Tumor FASTQ 2nd Paired-End file
   --tumor_cont_perc	<NUM>	Expected percentage of tumor contamination in normal 
				dataset 0-100
   --min_supp_reads	<NUM>	The minimum number of tumor supporting reads required for 
				calling a variant <default 4>
   --cpus_per_node	<NUM>	Number of cpus to be used for each node 
   --patient_id		<TEXT>	Text appended to each of the outputed filenames
* The order in the fastq files between pairs must be the same

 
Running SMuFin example from FASTQ files
Before using SMuFin for production, we advise the user to test its performance in a local platform by using the example dataset provided.

The example dataset corresponds to the chr22 Normal-Tumor pair samples from the in-silico genome used here .
The dataset files are distributed as:
  File name Content file (file name of paired-end reads)
normal_fastqs_1.txt
(normal files
80bps reads
500 insert size)
fasta_files/chr22_insilico_Normal_30x_3_1.fastq.gz *
fasta_files/chr22_insilico_Normal_30x_4_1.fastq.gz
fasta_files/chr22_insilico_Normal_30x_5_1.fastq.gz
fasta_files/chr22_insilico_Normal_30x_6_1.fastq.gz
fasta_files/chr22_insilico_Normal_30x_7_1.fastq.gz
normal_fastqs_2.txt
(normal files
80bps reads
500 insert size)
fasta_files/chr22_insilico_Normal_30x_3_2.fastq.gz *
fasta_files/chr22_insilico_Normal_30x_4_2.fastq.gz
fasta_files/chr22_insilico_Normal_30x_5_2.fastq.gz
fasta_files/chr22_insilico_Normal_30x_6_2.fastq.gz
fasta_files/chr22_insilico_Normal_30x_7_2.fastq.gz
tumor_fastqs_1.txt
(tumor files
80bps reads
500 insert size)
fasta_files/chr22_insilico_Tumor_30x_10_1.fastq.gz
fasta_files/chr22_insilico_Tumor_30x_11_1.fastq.gz
fasta_files/chr22_insilico_Tumor_30x_8_1.fastq.gz
fasta_files/chr22_insilico_Tumor_30x_9_1.fastq.gz
tumor_fastqs_2.txt
(tumor files
80bps reads
500 insert size)
fasta_files/chr22_insilico_Tumor_30x_10_2.fastq.gz
fasta_files/chr22_insilico_Tumor_30x_11_2.fastq.gz
fasta_files/chr22_insilico_Tumor_30x_8_2.fastq.gz
fasta_files/chr22_insilico_Tumor_30x_9_2.fastq.gz
 
* The file name of paired-end reads must be specified in the same line of each file.
 
The example dataset can be runned using this command line:
mpirun --np 16 ./SMuFin --ref ref_genome/hg19.fa --normal_fastq_1 normal_fastqs_1.txt --normal_fastq_2 normal_fastqs_2.txt --tumor_fastq_1 tumor_fastqs_1.txt --tumor_fastq_2 tumor_fastqs_2.txt --patient_id chr22_insilico --cpus_per_node 16


NOTE: This command will run SMuFin with default parameters:
    --min_supp_reads 4: Minimum number of tumor supporting reads required for calling a variant.
    --tumor_cont_perc 0: Expected percentage of tumor contamination in normal sample.
 
Output files
SMuFin provides three output files:
    somatic_SNV.txt: for Single Nucleotide Variations.
    somatic_small_SVs.txt: for small SVs.
    somatic_large_SVs.txt: for breakpoints of large SVs.
 
These 3 outputs refer to the three categories of somatic events: SNV, small SVs (deletions, insertions, inversions) and large SVs (breakpoints). The "small" and "large" correspond to variants smaller or larger than the read size.

The file content of outputs are:

somatic_SNV.txt
Mut_ID SNV ID
Type Mutation type. In this file always is SNV
Chr Reference chromosome id
Pos Reference 1-based SNV coordinate
Normal_NT Nucleotide found in normal genome
Tumor_NT Mutated nucleotide found in tumor genome
 
somatic_small_SVs.txt
Mut_ID Small SV ID
Type Mutation type: DEL, INS or INV
Chr Reference chromosome id
Pos Reference 1-based of the previous position of the small event
Size The length of the small SV
Sequence Sequence inserted in INS
 
somatic_large_SVs.txt
Mut_ID Large BKP ID
Type Mutation type: BKP
Chr_BKP_1 Reference chromosome id from the sequence before the breakpoint
Pos_BKP_1 Reference 1-based from the position just before the breakpoint
Chr_BKP_2 Reference chromosome id from the sequence after the breakpoint
Pos_BKP_2 Reference 1-based from the position just after the breakpoint
local(left_strand
left_ini..left_end
right_strand
right_ini..right_end)
left_strand Mapping strand of the left part of "Ext_Sequence"
local_left_ini Initial local mapping offset of the left part of "Ext_Sequence"
local_left_end Final local mapping offset of the left part of "Ext_Sequence"
right_strand Mapping strand of the right part of "Ext_Sequence"
local_right_ini Initial local mapping offset of the right part of "Ext_Sequence"
local_right_end Final local mapping offset of the right part of "Ext_Sequence"
Ext_Sequence The genomic sequence extension around breakpoint. < 200bps