Introduction

SOAPfuse is an open source tool developed for genome-wide detection of fusion transcripts from paired-end RNA-Seq data. By comparing with previously released tools, SOAPfuse has a good performance. It is developed in perl. So far, it is developed only for analysis on human being RNA-Seq data.

Alert:
The SOAPfuse project on SourceForge has been constructed, updates and news will be released on it for the first time. Please pay attention to its wiki and blog.

System Requirements

  1. Hardware:
    1. 64-bit x86-64 Intel CPU with SSE instructions
    2. About 8G memory to run with RNA-Seq data from Homo sapiens
  2. Software:
    1. 64-bit Linux System
    2. The version of perl is at least 5.8.5

Download

You can also check the 'Download' on the SourceForge.

  1. Download SOAPfuse from the link below.

  2. Release 1.26, 07-30-2013 New!

    download (MD5: e858b26d5fc5bb035bdf90e2299b25a1)

    Alert: Requirements for database files are changed in v1.26. Please use the script in v1.26 SOAPfuse to reconstruct your database in One-Step. Do not use database files of v1.25 and former versions for v1.26 version. Or else, errors may occur. To know about why the database was changed in v1.25, and now again in v1.26, please click here.

    Click here to check the update log.

    Release 1.22, 03-18-2012 (36M)

    download (MD5: 934633f93f394c6ad416375c165c8864)

    Release 1.25, 04-12-2013
    Release 1.24, 03-19-2013
    Release 1.21, 01-31-2012
    Release 1.20, 01-31-2012
    Release 1.1, 11-14-2011
    Release 1.0, 09-16-2011

  3. Database used by SOAPfuse.

  4. SOAPfuse has supplied one script for users to construct the whole SOAPfuse database in one step.

    Click here to know how to use this script to construct SOAPfuse database.

    Here, we only supplied the database we used in the former comparison works in our published paper.

    Note: This database package is only for v1.22 ~ 1.24 SOAPfuse, not for v1.25+.

    hg19 (7.9 GB, Homo sapiens, Ensemble Release59)

    download (MD5: 80a119ec441689575235c239aea88d95)

Installation

  1. Download the SOAPfuse package from the link above.
  2. In the Linux terminal:
  3. $ tar -xzf /PATH_WHERE_YOU_PUT_THE_TARBALL/SOAPfuse-vX.X.tar.gz
    $ cd SOAPfuse-vX.X/

Preparation

You can also check the 'Preparation' on the SourceForge.

  1. Prepare sample list
  2. Prepare list file for samples based on the format below (four columns).

          1                   2                 3             4
    [sample_ID] [sequence_library_ID] [run_ID] [read_length]

    e.g. sample A has RNA-Seq data from three runs: Run-a and Run-b are from the same Library (Lib-a, insert size is 300 nt), and Run-c is from another one Library (Lib-b, insert size is 170 nt). Sequenced read length of Run-a and Run-c are both PE90 nt, while Run-b is PE100 nt. The list file can be created like this:

    A   Lib-a   Run-a   90
    A Lib-a Run-b 100
    A Lib-b Run-c 90

    Note:

    1. Each line contains information of one run.
      If you have N runs for one sample, just write N lines. One run, one line.
    2. It is suggested to prepare one list for each sample if you want to analyze samples in parallel.
      As SOAPfuse needs one sample list for each operation, so N list files are suggested if you have N samples, and run SOAPfuse N times to analyze all samples in parallel.
    3. Insert size is not required.
      Yes, we think the insert size provided by user is not accurate, so it is not required in sample list.
      But SOAPfuse will use its algorithm to evaluate the actual insert size in the pipeline.
    4. Different read lengths are allowed.
      1. If you have RNA-Seq data of one sample from several runs but with different read length, never mind, SOAPfuse has a complete set of algorithms to distinguish them for accurate calculation.
      2. If in one run, the readlengths of /1 end and /2 end are different (uncommon). For example, sample A has another run (Run-d from Lib-a) in which /1 end is 80 nt and /2 end is 90 nt. SOAPfuse allows users to write the sample list like this:
        A   Lib-a   Run-a   90
        A Lib-a Run-b 100
        A Lib-b Run-c 90
        A Lib-a Run-d 80/90
        Of course, you can write it like this (as your wish):
        A   Lib-a   Run-a   90/90
        A Lib-a Run-b 100/100
        A Lib-b Run-c 90/90
        A Lib-a Run-d 80/90
    5. Sample list for somatic mode.
      For example, I am studying kidney cancer. And, tumor sample (K101-T, has Run-n RNA-Seq data [PE90] from Library Lib-n) and control sample (K101-N, has Run-m RNA-Seq data [PE100] from Library Lib-m) are from the same patient (patient-id is K101). I want to run SOAPfuse in somatic mode to detect the somatic fusion transcripts. Just write the information of K101-T and K101-N in one sample list, like this:
      K101-T   Lib-n   Run-n   90
      K101-N Lib-m Run-m 100
      SOAPfuse distinguishes the tumor sample and control sample based on the postfix of sample-ID. You need to state the postfixes of sample-ID via the parameter 'PA_all_postfix_of_tissue' in the config file, like ' -T ' for tumor sample and ' -N ' for control sample.
      Of course, the config parameter 'PA_all_somatic_mode' must be set as 'yes' to enable the somatic mode.

  3. Prepare RNA-Seq data
  4. The RNA-Seq data fastq/fasta (requirement) files should be stored according to certain directory structure based on the sample list file mentioned above.

    ==>Follow the next five requirments to construct directories to store RNA-Seq data:

    1. Master directory stores all RNA-Seq data files in its sub-directories. We call it 'WHOLE_SEQ-DATA_DIR'.
    2. Use sample_ID to name sub-directories of WHOLE_SEQ-DATA_DIR. We call them 'SAMPLE_DIR'.
    3. Use sequence_library_ID to name sub-directories of SAMPLE_DIR. We call them 'LIB_DIR'.
    4. RNA-Seq data (fastq/fasta) files are stored in LIB_DIR with their Run_ID as file prefix.
      As SOAPfuse deals with paired-end reads, so the prefix should also concatenate with serial number of read, just like 'Run_ID_1' and 'Run_ID_2'.
    5. There is no requirements for PostFix of RNA-Seq data files.
      Generally, we use 'fq.gz' (fastq) or 'fa.gz' (fasta). Read files are always stored in compressed format (gz). Anyway, the PostFix must be stated via parameter 'PA_all_fq_postfix' in config file.

    e.g. For sample A mentioned in sample list instance. Its RNA-Seq data files (fastq) will be stored like this:

Run SOAPfuse

You can also check the 'Run SOAPfuse' on the SourceForge.

To run SOAPfuse, we need to prepare the config file, and SOAPfuse will run based on the configuration.

  1. Check the config file:
  2. $ cd /PATH_WHERE_YOU_PUT_THE_PACKAGE/SOAPfuse-vX.X/config/
    $ less -S config.txt

    Note:

    1. All lines prefixed by '#' should be considered as comments.
    2. Value and parameter name are separated by '=', and just modify the value behind '='.
    3. Some values can be set as 'yes' or 'no', and some can be left as defaults.
    4. Check prefix of each parameter.
      There are five kinds of prefixes, they are 'DB','PG','PS','PD' and 'PA'.
      1. 'DB' means the info of DataBase.
      2. 'PG' means the info of ProGrams.
      3. 'PS' means the info of Pipeline Steps.
      4. 'PD' means the info of Pipeline Directories.
      5. 'PA' means the info of PArameters.
      #'DB','PG','PS' and 'PD' types are related to the database, so SOAPfuse could run successfully once these parameters are set accurately. 'PA' type is related to the parameters of each step, and they have been set as default value, so you can ignore them in your first time trying. But, 'PA_all_fq_postfix', which defines the PostFix of RNA-Seq data files, should be set accordiing to your RNA-Seq files before running.

  3. Modify the config file:
  4. Now we presume that you have unpacked the SOAPfuse package, and obtained the SOAPfuse-vX.X directory. We call the absolutepath of this directory as 'TOOL_DIR'.
    Download database package ('hgXX-XX.for.SOAPfuse.tar.gz') from links aboved, and unpack it, then get the hgXX-database directory. We call the absolutepath of this directory as 'DATABASE_DIR'.
    #You can also follow the guide to construct your own database files in DATABASE_DIR.

    Then, modify the config file as below:

    1. Define 'DB' prefix info
    2. DB_db_dir = /DATABASE_DIR/
    3. Define 'PG' prefix info
    4. PG_pg_dir = /TOOL_DIR/source/bin
    5. Define 'PS' prefix info
    6. PS_ps_dir = /TOOL_DIR/source
    7. Define 'PD' prefix info*
    8. PD_all_out = /out_directory/
    9. Define 'PA_all_fq_postfix' prefix info
    10. PA_all_fq_postfix = PostFix

    * PD_all_out is the directory which you prepared to store all results of SOAPfuse.
       You can set it via the option ('-o') of main program which is introduced below, and it has the higher priority.
       SOAPfuse will creat the sub-directories of each step in out_directory automatically when it runs.

  5. Run SOAPfuse:
  6. You can find the main script 'SOAPfuse-RUN.pl' in TOOL_DIR. Use 'perl' to run it.

    Command:

    perl SOAPfuse-RUN.pl -c <config_file> -fd <WHOLE_SEQ-DATA_DIR> -l <sample_list> -o <out_directory> [Options]

    Options:

    -c  [s] Config File for run this pipeline. <required>
    -fd [s] Directory which stores Paired-end Sequenced Read Files. <required>
    Sequenced Reads Format can be fastq or fasta.
    Files could be compressed by gzip or just readable text-format.
    -l [s] The information list of sample(s) you want to deal. <required>
    This list can include infomation of one or more samples.
    It is suggested to include one sample/patient in each sample list file.
    -o [s] Directory which will store all results.
    It has the first priority, or you should set 'PD_all_out' in config file.
    -fs [i] The step you want to start from. [1]
    -es [i] The step you want to end at. [9]
    Step 9 is the last step of the SOAPfuse pipeline.
    -tp [s] The name-postfix of temp directory*. [`data +%s`.'_'.int(rand(1000)+1)]
    Donot set same string for different Sample-info-list files.
    It is suggested to set this parameter as same as SampleID for distinguishing
    the scripts of different samples easily in the general case that one
    sample-info-list file just includes one sample.
    -fm Sign to enable perl fork management. [disabled]
    -h Display this help info.
    * We suggest to set -tp as the sample-ID or patient-ID to easily distinguish the temp directory, as we have suggested to prepare one list for each sample or patient (in somatic mode).

    Other Command:

    1. To check the version of SOAPfuse
    2. perl SOAPfuse-RUN.pl -c version
    3. To check the authors of SOAPfuse
    4. perl SOAPfuse-RUN.pl -c who_is_author

Output Files

You can also check the 'Output Files' on the SourceForge.

$ cd /out_directory/final_fusion_genes/sample-ID_or_patient-ID/

In this directory, you can find results of SOAPfuse.

Note: the following description is for v1.25+, click here to see old format of v1.22.

  1. sample-ID_or_patient-ID.final.Fusion.specific.for.genes
  2. This text file contains the predicted fusion events specific for genes. The format of this file is:
    Column_NO.     Descriptions
    1 up stream fusion gene (5' partner)
    2 chromosome of up stream fusion partner
    3 strand of up stream fusion partner
    4 genome junction position of up stream fusion partner
    5 location of up stream fusion partner's junction point
    6 down stream fusion gene (3' partner)
    7 chromosome of down stream fusion partner
    8 strand of down stream fusion partner
    9 genome junction position of down stream fusion partner
    10 location of down stream fusion partner's junction point
    11 number of span-reads (details in sample-ID.final.Span_reads)
    12 number of junc-reads (details in sample-ID.final.Junc_reads)
    13 classification of fusions
    14 whether down stream fusion partner is frame-shift or in-frame-shift
  3. sample-ID_or_patient-ID.final.Fusion.specific.for.trans
  4. As we know, each gene has its specific transcript isoform(s), and lots of genes have multiple isoforms. From v1.24 version, SOAPfuse detects fusion transcripts for each transcript isoform. For example, one fusion partner, Gene-A, its junction point may be covered by several transcript isoforms, such as A-001 and A-201. And its fusion partner, Gene-B, is also in the same situation. So, for one fusion gene pair (A-B), you could find several corresponding fusion transcripts, each one consists of one isoform of Gene-A and one isoform of Gene-B.
    This text file contains the predicted fusion events specific for transcript isoforms. The format of this file is:
    Column_NO.     Descriptions
    1 up stream gene (5' partner)
    2 up stream transcript isoform
    3 chromosome of up stream fusion partner
    4 strand of up stream fusion partner
    5 transcript junction position of up stream fusion partner
    6 genome junction position of up stream fusion partner
    7 location of up stream fusion partner's junction point
    8 down stream gene (3' partner)
    9 down stream transcript isoform
    10 chromosome of down stream fusion partner
    11 strand of down stream fusion partner
    12 transcript junction position of down stream fusion partner
    13 genome junction position of down stream fusion partner
    14 location of down stream fusion partner's junction point
    15 number of span-reads (details in sample-ID.final.Span_reads)
    16 number of junc-reads (details in sample-ID.final.Junc_reads)
    17 classification of fusions
    18 whether down fusion partner is frame-shift or in-frame-shift
    19 type of up stream fusion isoforms
    20 type of down stream fusion isoforms
    21 the area where up stream juction point locates
    22 the area where down stream juction point locates
    23 whether the up stream isoform has the start codon (in database)
    24 whether the up stream isoform has the stop codon (in database)
    25 whether the down stream isoform has the start codon (in database)
    26 whether the down stream isoform has the stop codon (in database)
    27 whether the fusion structure has the stop codon
    28 the information of fusion peptide chain
  5. sample-ID_or_patient-ID.final.Span_reads
  6. This text file contains the detailed information of span-reads.
  7. sample-ID_or_patient-ID.final.Junc_reads
  8. This text file contains the detailed information of junc-reads.
  9. initial_fusions
  10. This directory contains the initial fusion events before filtering.
  11. abandon_fusions
  12. This directory contains the fusion events abandoned by different filterings.
  13. analysis
  14. This directory contains the further analysis of final fusion events.
    1. In For_RT-PCR_validation directory
      1. sample-ID_or_patient-ID.trans.fused.seq.for.RT-PCR
      2. This text file contains the bilateral segment around junction point of each fusion transcript. It is corresponding to the sample-ID_or_patient-ID.final.Fusion.specific.for.trans file. The up stream segment is separated with the down stream segment with vertical line '|'. The segments are very useful for users to construct the primers for their further RT-PCR validation experiments of concerned fusion events. It has been proved useful and correct by several fusion research projects in BGI.
        Format:
        Column_NO.     Descriptions
        1 fusion event specific for genes
        2 fusion event specific for transcript isoforms
        3 bilateral fusion segments separated by '|'
    2. In For_peptides_analysis directory
      1. sample-ID_or_patient-ID.trans.fusion.peptide.chain
      2. This text file contains the predicted fusion peptide chains corresponding to all fusion transcripts that are annotated as 'inframe-shift' in sample-ID_or_patient-ID.final.Fusion.specific.for.trans file. You can use the peptide chain string to study the domain regions in iprscan website.
        Format:
        Column_NO.     Descriptions
        1 fusion event specific for transcript isoforms
        2 whole fusion segment
        3 fusion peptide chain string
    3. In figures directory
      1. fusions_expression sub-directory
      2. This sub-directory contains the so-called 'SOAPfuse fusion figure' (svg/png) of each fusion events. This figure merges the old fusion figure and gene expression figure (in v1.22), clearly showing the transcript junction point, genome junction point and supporting reads alignment. It is useful for users to select the interesting fusion events.
        Set config parameter 'PA_s09_draw_fusion_expression_svg' as 'yes', and enjoy this figure.
        click here for more information about SOAPfuse fusion figure.
        click here to see one instance of TMPRSS2-ERG from one prostate cancer sample.
        click here to see one instance of somatic KIF5B-RET from one lung cancer patient.
        Note:
        'SOAPfuse fusion figure' is totally designed and achieved by Wenlong Jia in BGI. We will apply for a patent for this figure.
      3. landscape_of_fusions sub-directory
      4. This sub-directory contains the so-called 'SOAPfuse 3D landscape figure of fusion' (svg/png) of concerned sample/patient. This figure displays all detected fusion events in 3D pillar array. Fusion events are shown with SOAPfuse-score, Fusion-type, inframe-shift-or-not, fused-at-exon-edge-or-not and whether-is-somatic (in somatic running mode). It is also useful for users to select the interesting fusion events.
        click here for more information about SOAPfuse 3D landscape figure of fusion.
        click here to see one instance of one prostate cancer sample harboring TMPRSS2-ERG.
        click here to see one instance of one lung cancer patient harboring somatic KIF5B-RET.
        Note:
        'SOAPfuse 3D landscape figure of fusion' is also totally designed and achieved by Wenlong Jia in BGI. Using 3D scale to display bioinformatic value distribution is common, but to our knowledge, this is the first time that it is used in the fusion gene research.

Publication

SOAPfuse has been published as Method article in Genome Biology.

  1. Jia W, Qiu K, He M, Song P, Zhou Q, Zhou F, Yu Y, Zhu D, Nickerson ML, Wan S, Liao X, Zhu X, Peng S, Li Y, Wang J, Guo G: SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome biology 2013, 14:R12.

Performance Evaluation

You can also check the 'Performance Evaluation' on the SourceForge.

We evaluated six tools including SOAPfuse (v1.22) based on both actual and simulated datasets. This comparison work is also mentioned in SOAPfuse publication.

  • Released datasets
  • We used released RNA-Seq data, downloaded from NCBI SRA, from two published researches as actual datasets. This two studies discovered some validated gene fusions based on their RNA-Seq data, which are specified with Sanger sequences in their supplementaries. One is concerned with melanoma and CML (dataset A, 15 fusions), and another one is breast cancer (dataset B, 27 fusions). All information of validated gene fusions are based on Ensemble Release59 of hg19.

    We applied SOAPfuse (based on Ensemble Release59, hg19-GRCh37.59) to analyze the downloaded RNA-Seq data, and compared the results of other five published tools. Comparison is shown as below.


    For dataset A, which contains ~111 million paired-end reads, SOAPfuse consumed the least CPU time (~5.2 hours) and the second least memory (~7.1 Gigabytes) to complete the data analysis (including the alignment of reads against reference), and was able to detect all the 15 fusion events. DeFuse and FusionHunter detected comparable number of known fusion events (12~13 of the 15 fusions), but took 82.1 and 21.3 CUP hours, respectively, at least four times as much as SOAPfuse. The computational resource cost of SnowShoes-FTD was comparable with SOAPfuse, but SnowShoes-FTD only identified eight of 15 events. The remaining two tools, chimerascan and TopHat-Fusion, detected four confirmed fusion events but used significantly more CPU hours or memory usage. For dataset B containing ~55 million paired-end reads, SOAPfuse detected 26 of the 27 reported fusion events with 4.1 CPU hours and 6.3 Gigabytes memory. The other five tools were able to identify comparable numbers of reported fusions (15~21) and cost at least 6.4h CPU time. As we can see, SOAPfuse shows the best performance in all three aspects.

    To get configs (parameters) and results of all tools about downloaded datasets, please click here.

  • Simulated datasets
  • For simulated datasets, we generated a set of paired-end reads (2 x 75 nt) based on the transcriptome of human (hg19, Homo_sapiens, Ensemble Release59). We simulated 150 fusions based on some criterions, and generated PE-reads at 5-, 10-, 20-, 50-, 80-, 100-, 150x and 200x fold sequencing depth (to imitate different expression levels) using the short-read simulator provided by MAQ (Li et al., 2008).

    And then, we mixed simulated reads of each fold with cleaned background data (BG). BG is downloaded from NCBI Sequence Read Ar-chive (SRA) under accession NO. SRR065491 and SRR066679, which were generated by the ENCODE Caltech RNA-Seq project (Birney, et al., 2007; Raney, et al., 2010). It is RNA-Seq data from embryonic stem cells, and also used as background by FusionMap (H. Ge, et al., 2011).

    Chimerascan, FusionHunter and SnowShoes-FTD only detect cases fused at edge of exon, considering not all simulated cases are exon-edge type, we abandoned comparing this three tools. Several strategies are applied to achieve fair and conservative comparison. Combining all results, 149 (99%) are detected, and 142 (94%) are confirmed by at least two tools, proving our simulation is available. Further to be prudent, compares are operated based on these 142 simulated cases for their ratification by at least two algorithms. Comparision is shown as below.


    As expected, FN rates decreased with increasing expression levels of fusion transcripts (a). SOAPfuse and deFuse achieved the lowest FN rates at 5% with fusion transcript expression levels of 30-fold or greater. TopHat-Fusion had higher FN rates, especially at low fusion transcript expression levels (5~20-fold). For FP rate (b), only SOAPfuse achieved < 5% at different fusion transcript expression levels, while deFuse and TopHat-Fusion had higher FP rates at lower fusion transcript expression levels. SOAPfuse (v1.22) missed 3 simulated fusions which are detected by both deFuse and TopHat-Fusion (c), revealing a weakness in analysis of homologous gene sequences and short fusion transcripts of long genes. We have fixed it from v1.24.

    Generally, lower FN rates and lower FP rates are contradictory for detection of fusions, however, SOAPfuse and deFuse are good at reducing FN and FP rates during fusion transcript identification. In summary, SOAPfuse showed optimal performance with low FN and FP rates at different expression levels of fusion transcripts.

    To get simulated datasets, click here.
    To get configs and results of all tools, click here.

  • Cell lines datasets
  • We also sequenced paired-end RNA-Seq reads for two bladder cancer cell lines, and applied SOAPfuse on it with some criterions. SOAPfuse identified a total of 16 fusions, all of which are intrachromosomal and fused at exon-edge. We designed primers for RT-PCR experimental validation of all predicted fusions, and Sanger sequences confirmed 15 genuine fusion events (93% validation rate), in which 6 pairs are novel and shared by this two cell lines. There are some validated fusions that may be caused by chromosomal rearrangements on genome with strong signals. Some are formed by genes from different strands that imply potential inversions, and some fusions are formed by same strand genes with their reversed genomic orientation. RNA-Seq data from the two bladder cancer cell lines has been submitted to NCBI SRA and is available under accession number [SRA052960].

    References

    • Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S, Khrebtukova I, Barrette TR, Grasso C, Yu J, et al.(2009b) Chimeric transcript discovery by paired-end transcriptome sequencing. ProcNatlAcadSci 106:12353–12358.
    • Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Adiconis X, Maguire J, Johnson LA, Robinson J, Verhaak RG, Sougnez C, et al. 2010. Integrative analysis of the melanoma transcriptome. Genome Res 20: 413–427.
    • Edgren H, Murumaegi A, Kangaspeska S, Nicorici D, Hongisto V, Kleivi K, Rye IH, Nyberg S, Wolf M, Boerresen-Dale AL, et al. Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol.12:R6.
    To Top