Introduction

SOAPfusion is a novel tool for fusion discovery with paired-end RNA-Seq reads. The tool follows a different strategy by “finding fusions directly and verifying them”, differentiating it from all other existing tools by “finding the candidate regions and searching for the fusions afterwards”. This enables the fusion discovery process to be more effective and sensitive, also with a specular performance under low coverage of sequencing far more better than other tools.

In the package, it encoporates several modules developped specifically for mapping RNA-seq reads, finding gene fusions and false positive elimination. One thing to note is that the SOAPfusion-aligner is designed for mapping RNA-seq reads both in an intact manner and segemental (with long distantance in-between, or on different chromosomes) manner, which can be applied in other bioinformatics senario requiring segmental mappings as well.

SOAPfusion is accurate and efficient for fusion discovery under various sequencing coverage (10X~50X, see Section “Performance”) with high sensitivity (≥97%), low false positive rate (≤1.36%) and has a saturation level of 10X, highlighting its ability to detect fusions efficiently at low sequencing cost.

The alogrithms and software in this package were developed by the Algorithm and Bioinformatics Group at The University of Hong Kong (Jikun Wu , Dr. S.M. Yiu). In collaboration with BGI (Zhiyu Peng, Wenqian Zhang).

System Requirements

  1. Hardware:
    1. 64-bit x86-64 Intel CPU with SSE instructions
    2. The program needs ~600 MB main memory to run with data from Homo sapiens
  2. Software:
    1. 64-bit Linux System
    2. The version of gcc compiler is at least 4.2.4
    3. The version of perl is at least 5.8.5

Download

  1. Download SOAPfusion executable files from the link below.

  2. New Release: for your convenience, please download the all- in-one package to run SOAPfusion.

    Release 1.1, 10-08-2012

    download(MD5: 34d88900b0a56863adaabfcbda3355e6)

    Release 1.0, 14-06-2012

    download(MD5: 46c5836e3991c2a7051d2d1bd0ab0ccf)

    Release 0.2, 11-15-2011

    download(MD5: 0af0abec456518d839704d7d000a0eac)

    Release 0.1, 01-07-2011

  3. Download Pre-bulilt indexes.

  4. H. sapines, UCSC hg18 (size: 226 MB)

    download(MD5: 6c92708b0488cab9f75eac7794ed866c)

Usage

  • Download SOAPfusion-vX.X.tar.gz with pre-built hg18 reference files and extract it to your local directory.
  • $ tar -zxf SOAPfusion-vX.X.tar.gz
    $ cd SOAPfusion-vX.X
    
    After, you can find the structure of the program directory as:
    SOAPfusion
            |-- soapfusion.sh                
            |-- soapfusion.cfg        
            |-- bin
            |   |-- msort                   
            |   |-- indexbuilder.sh                    
            |   |-- maskFastaFromBed         
            |   |-- 2bwt-builder            
            |   |-- discovery.pl            
            |   |-- extra_tools.pl          
            |   `-- soapfusion_aligner       
            |-- README.pdf                       
            |-- example
            |   |-- nc.bed                   
            |   `-- example.cfg              
            `-- reference
    
  • Reference Preparation
  • Besides downloading reference files, you can also prepare your own set of reference files with following steps.
    1. First download the transcript referenc files from UCSC database: http://genome.ucsc.edu/cgi-bin/hgTables?command=start and configure the ./bin/indexbuilder.cfg.
    2. ################## Configuration ######## 
      
      ################## UCSC Gene Annntation File ##################### 
      program_path="/usr/local/bin/soapfusion" 
      
      species="HG19" 
      
      transcript_psl="/usr/local/bin/soapfusion/reference/ref_hg18.psl" 
      
      NMToTreefam="/usr/local/bin/soapfusion/NMToTreefam.txt"
      
      kgXref="/usr/local/bin/soapfusion/kgXref" 
      
      ------------------------------------------------------------------
      
      * Sample of formats for these reference files can be found at example directory.
    3. Then get a list of known exons in BED formats.
    4. Run the script indexbuilder.sh in bin directory.
      ./bin/indexbuilder.sh 
      * Sample of formats for these nc.bed can be found at example directory.
    5. Masking reference genome with maskFastaFromBed from samtools, in bin directory. Put the reference genome genome.fa into ./reference/species first.
    6. ./bin/maskFastaFromBed -fi ./reference/species/genome.fa -bed 
      ./reference/species/nc.bed -fo ./reference/species/genome.mk.fa 
      
      *replace species with species name you have configured in above indexbuilder.cfg.
    7. Indexing for Reference Genome with 2bwt-builder.
    8. ./bin/2bwt-builder ./reference/species/genome.mk.fa
  • How to Run
  •     After getting the reference files, you can run SOAPfusion with following steps.
    1. Config the configuration file soapfusion.conf:
    2. ################## Configuration ######## 
      
      ################## Executable Path ##################### 
      program_path="/usr/local/bin/soapfusion" 
      
      work_path="./" 
      
      ################## Sample Name and Read Path ##################### 
      species="HG19"
      sample_name="c1X" 
      
      read1="./c1Xr1.fq" 
      read2="./c1Xr2.fq" 
      
      ################## SOAPfusion-aligner Configuration  ##################### 
      mismatches=3            # number of mismatches allowed in SOAPfusion-aligner
      indels=2                     # number of indels allowed in SOAPfusion-aligner
      
      insert_size=200         # insert size of RNA-seq reads
      read_length=75         # length of input reads
      
      thread_num=5          # number of threads to use
      
    3. Run the whole pipeline with sh soapfusion.sh.
    4. * In case you cannot get it run, use /bin/bash soapfusion.sh instead.
  • Result file format

  •     The final list of reported fusions were listed in *.fusionList.

        The description of each column:
        1. 5' parental gene.
        2. 3' parental gene.
        3. transcript ID for 5' parental gene.
        4. the exon of 5' parental gene where fusion site locates.
        5. transcript ID for 3' parental gene.
        6. the exon of 3' parental gene where fusion site locates.
        7. chromosome ID of 5' parental gene.
        8. chromosome ID of 3' parental gene.
        9. 5' fusion site on chromosome.
        10. 3' fusion site on chromosome.
        11. strand of 5' parental gene, 'fwd' stands for '+', 'rev' stands for '-'.
        12. strand of 3' parental gene, 'fwd' stands for '+', 'rev' stands for '-'.
        13. fusion reads, e.g. 1-210 stands for the 210th read in fq file 1 is segmentally mapped to the fusion site while its     mated read is mapped to one of the parental genes.
        14. total number of fusion spanning reads which are intactlly mapped to two parental genes.
        15. fusion spanning reads, e.g. '210' stands for the 210th PE reads are intactlly mapped to two parental genes.
To Top

Performamce & Reference

We evaluated SOAPfusion among current available fusion discovery tools, by compare it with previously published ones.
We obtained a total of 31,399 RefSeq transcripts of human reference hg18. By randomly picking up two transcripts and two exon boundaries of them, we joined the two transcripts to form fusion transcripts at the picked exon boundaries. In this way 300 fusions were simulated. Then we ran MAQ on the new transcriptome (including all RefSeq transcripts and the newly simulated fusion transcripts) to simulate PE reads. In total, we generated seven sets of simulated reads under sequencing depth 1X, 5X, 10X, 20X, 30X, 40X and 50X, file sizes of which are 242 Mb, 1.19 Gb, 2.40 Gb, 4.80 Gb, 7.20 Gb, 9.60 Gb and 12 Gb respectively. The comparison result were summarized as:

The upper diagram gives the sensitivity of all tools under coverages 1~50X. The lower one shows the FDR (False Discovery Rate) of all tools under coverage 1~50X.

Of all existing tools, SOAPfusion has the highest sensitivity and lowest FDR under all coverages. And even in low coverage, it still had a good performane.

* FusionSeq (v 1.42.1), deFuse (v 0.4.1), FusionHunter (v 1.2), ShortFuse (v 0.2), FusionMap (v 0.6.1) and TopHat-Fusion (v 0.1.0) were used to conduct comparisons with SOAPfusion. To keep consistency, we used reference genome hg18 for all tools. Additionally, to benchmark the performance of different tools, in the first place we set the same threshold for the required number of fusion supporting reads in preliminary experiments. But it was found this caused some tools far worse than expected according to the statements in their publications. Therefore, we chose to use default parameters or specified parameters suggested in those publications. All tools are tested under CentOS v5.5 on a computer of x86_64 architecture with 30GB memory.

Reference

  • T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong and S. M. Yiu. Compressed indexing and local alignment of DNA. Bioinformatics. Vol. 24 no. 6 2008, pages 791–797.

*For more details, you can read the 'README.pdf' in the software package.
To Top