Introduction

SOAPsplice: We have developed a tool SOAPsplice for genome-wide ab initio detection of splice junction sites from RNA-Seq, a method using new generation sequencing technologies to sequence the messenger RNA. The tool performs better than the previous tools by detecting similar number of true junctions with lower false positives as the best performing tool in all different situations with different read lengths and coverage. In particular, SOAPsplice performs better by predicting more true junctions with low false positive rate when the coverage is low, which is useful for detecting the junctions for those mRNAs with relatively lower expression level. For more information, please refer to the part of "Performance Evaluation" in this page. SOAPsplice is free for academic use only.

System Requirements

  1. Hardware:
    1. 64-bit x86-64 Intel CPU with SSE instructions
    2. The program needs ~6G main memory to run with data from Homo sapiens.
  2. Software:
    1. 64-bit Linux System
    2. The version of gcc compiler is at least 4.2.4
    3. The version of perl is at least 5.8.5

Download

  1. Download the SOAPsplice from the link below.

  2. Release 1.10, 04-24-2013New!

    CHANGE:
    1. Fix some bugs which cause "Segmentation Fault"
    2. Use gzip format to generate *.out and *.2Segs result files.

    download (MD5: df038fd6657885b77932111652e9203d)

    Release 1.9, 02-23-2012

    CHANGE:
    1. Option -r (the length of reads, which is important to generate junction file) is obsolete, the program can get this value automatically

    download (MD5: 698feb18aafbf7a12be4028073d919de)

    Release 1.8, 01-11-2012

    CHANGE:
    1. Fix a crash bug
    2. Do not warn about invalid position any more

    download (MD5: 175b119c047d0c4aad0ea88516890423)

    Release 1.7, 11-10-2011

    CHANGE:
    1. Improved the upper limit of mismatch number to 5 in one segment alignment
    2. Fix a problem about missing mismatch details of some reads in ".out" files
    3. Fix a bug in SAM format, the gap (N) should minus 1 in CIGAR field

    download (MD5: c9ebd7c36a5b9e40998792ef9121ef11)

    Release 1.6, 10-17-2011

    CHANGE:
    1. Fix a bug about segmentation fault when generating the junction result
    2. Reduce memory usage when generating the junction result
    3. Correct the wrong count of alignment result in statistics
    4. Fix a bug about MD field in SAM format result
    5. Do not change the ID like "gi|5524211|gb|AAD44166.1|" to "gb|AAD44166.1|" any more in reference sequence file when building indexes

    download (MD5: c1b438b9e9dbf5092418ee2661e1232d)

    Release 1.5, 09-02-2011

    Note: The package was updated on September 3, 2011 to fix a bug when building indexes

    CHANGE:
    1. Fix a bug about segmentation fault when output SAM format result
    2. Fix a bug when the number of reference sequences is greater than 65535
    3. Fix a bug about junctions' support count
    4. Fix a bug about position translation when building indexes (please see Note below)
    5. Add CIGAR, NM, and MD fields for InDel (edit distance) in output format SAM
    6. Merge all the perl scripts into "soapsplice", and "soapsplice" can output junction result directly, related to options "-L", "-l", "-I", "-r", and "-j". For details, see help information of "soapsplice"

    download (MD5: 0db461ef6bef57c9bcddf0beb7af2c4d)

    Note:

    Some reference sequences indexes need to be rebuilt (file .index.tra is affected), or use the following Perl script to just regenerate the file .index.tra

    $ perl genPosTra.pl ref.fa > ref.fa.index.tra
    

    download (MD5: 1b463513c39d649fdf4a6aa8b4b485b8)

    Release 1.4, 07-29-2011

    CHANGE:
    1. "soapsplice" can output SAM format result now, related to options "-f", "-q", and "-s". For details, see help information of "soapsplice"

    download (MD5: b427875a7ea6c9a6dc19d52e0e1790a4)

    Release 1.3, 07-25-2011

    CHANGE:
    1. Fix a potential bug about alignment position
    2. Change some places when output SOAP format result, make it correspond with SOAP format
    3. Fix a bug when output junctions from misordered result files generated by multi-thread

    download (MD5: 118f65e78ac6f1a1bb9517d485443b29)

    Release 1.2, 07-15-2011

    CHANGE:
    1. Support parallel thread alignment, set thread number in the -p option
    2. Support gzip fastq as soapsplice input data and gzip fasta as 2bwt-builder input data
    3. Support SOAP format result as its output files instead of the orginal format in version 1.0, set result format in the -f option
    4. Add version number to the help information and running log
    5. The first parameter of soapsplice "short" is not necessary any more
    6. Remove the config file "2bwt-builder.ini"
    7. Fix a bug about corrupted characters where reporting the alignment strand of some reads
    8. Fix a bug about "Segmentation fault" message occasionally when building index is nearly over

    download (MD5: f52304605d9564956c9d7e2b5a265ad7)

    Release 1.1, 06-19-2011

    CHANGE: fix some bugs
    download (MD5: 00b21c4cfdcf50fa66c68cae17563d9f)

    Release 1.0, 01-21-2011

    download (MD5: bf7d9b367233a15bbcd8d386ee6227d8)

  3. Download pre-built indexes.

  4. Homo sapiens, UCSC hg18 (7.4 GB)

    download (MD5: c5338f09617e9b99a3e92a9a7a4fcdc5)

    Homo sapiens, UCSC hg19 (7.5 GB)

    download (MD5: 7e8f0f475fea576fe9f5c7e0f71ff335)

    Mus musculus, UCSC mm9 (7.5 GB)

    download (MD5: 7139fb1e0fc071ed8524be71653d1cc8)

    Arabidopsis thaliana, TAIR 9 (386 MB)

    download (MD5: c413b1d333c01522732344fdbcd9f7d7)

Installation

  1. Download the SOAPsplice from the link above.
  2. In the Linux terminal:
  3. $ tar -xzf /PATH_WHERE_YOU_PUT_THE_TARBALL/SOAPsplice-vX.X.tar.gz
    $ cd SOAPsplice-vX.X/
    
  4. In the directory of ~/SOAPsplice-vX.X/example, there is a script "example.sh" for program testing.

Command Line Options

To run SOAPsplice, we need to build index files for the reference genome, and seach reads against the formatted index files.

  1. Format reference sequence:
  2. <ExecutablePath>/2bwt-builder <FASTA sequence file> <Output index prefix>
    e.g.: ./2bwt-builder ~/human_genome.fa
    or ./2bwt-builder ~/human_genome.fa ~/index/human_genome.fa
    

    Note:

    The index files will be in the directory "~/index/".

    Then under the directory there will be 16 files, all their prefixes are your_fasta file name with ".index" added, e.g. human_genome.fa.index. The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hocc, *.hpac, *.ht, *.hv, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt, *.rev.pac, *.sa4, *.tra.

    If you don't input "<Output index prefix>", a collection of files will be built with the filename prefix "<FASTA sequence file>.index", otherwise the prefix is "<Output index prefix >.index".

  3. Alignment:
  4. For paired-end reads:

    $ soapsplice -d <2BWT index prefix> -1 <reads_a> -2 <reads_b> -r <length of reads> -I <insert size> -o <prefix of output files> [Other Options]
    

    Options:

      -d <str>  * Prefix of reference index files
      -1 <file> * Reads file 1, FASTA/FASTQ format
      -2 <file>   Reads file 2, FASTA/FASTQ format
      -o <path> * Prefix of output files, (can be with file directory)
    
      -p <int>    Number of threads, <= 20, 1 (default)
      -S <int>    1: Forward, 2: Reverse, 3: Both (default)
    
      -m <int>    Maximum mismatch for one-segment alignment, <= 5, 3 (default)
      -g <int>    Maximum indel for one-segment alignment, <= 2, 2 (default)
      -i <int>    Length of tail that can be ignored in one-segment alignment, 7 (default)
    
      -t <int>    Longest gap between two segments in two-segment alignment, 500000 (default)
      -a <int>    Shortest length of a segment in two-segment alignment, 8 (default)
    
      -c <int>    1: Output read and its quality in one-segment alignment output file (default)
                  0: Don't output such information to save disk space
      -f <int>    Format of output files, 0: original (default), 1: SOAP, 2: SAM
    
    For Output Format SAM:
      -s <int>    Set the MAPQ (mapping quality) field to this value, just valid in SAM format, 255 (default)
      -q <int>    Input quality type in FASTQ file
                  0: quality = Phred + 64, used in Illumina/Solexa format, (default)
                  1: quality = Phred + 33, used in Sanger format
    
    For Output Junctions:
      -r <int>    The length of reads
      -L <int>    The maximum distance between paired-end reads, 500000 (default)
      -l <int>    The minimum distance between paired-end reads, 50 (default)
      -I <int>    The insert length of paired-end reads
      -j <int>    1: Output junction information (default)
                  0: Don't output junction information
    

    Note:

    Running time warning: if the input data is larger than 10G and the reads are longer than 50bp, SOAPsplice will become very time-consuming. For example, it takes SOAPsplice 109 CPU hours to finish the alignment for 18,584,414 pairs of 130bp reads from Homo sapiens.

Output Files

  • One-segment alignment result: *.out
  • Two-segment alignment result: *.2Segs
  • Junction file: *.junc

For more detailed info about these files' format description, read the document in the program package.

Performance Evaluation

We evaluate four tools based on both simulated and real datasets. For simulated datasets, we generated a set of reads (with read length = 50) based on the transcripts of human chromosome 10. We exclude three transcripts which are shorter than 200nt. In total, we have 1,296 transcripts. For each read length, we generated reads at 1-, 5-, 10-, 20- and 50-fold sequencing depth using the short-read simulator provided by Maq (Li et al., 2008).

Table 1 shows the results for read length of 50nt. In all sequencing depths except the case with 1x coverage, MapSplice reports more or similar number of true junctions as SOAPsplice, however, they give a lot more false positives, in particular when the coverage becomes higher (e.g. MapSplice produces more than 6 times false positives in the case with 50x coverage). This matches our observation that when the read length is relatively short, they will give more false positives. For the case with 1x coverage, we seem to report more true junctions with fewer false positives. For the case with 5x coverage, on the other hand, MapSplice produces more true junctions but also gives 3 times more false positives. The other two tools performed relatively worse than SOAPsplice and MapSplice.

To regenerate the results in Table 1, download the file simulate_data.tar.gz (699 MB)

Publication

References

  • Li, H., J. Ruan, et al. (2008). "Mapping short DNA sequencing reads and calling variants using mapping quality scores." Genome Res 18(11): 1851-8.
  • Wang, E. T., R. Sandberg, et al. (2008). "Alternative isoform regulation in human tissue transcriptomes." Nature 456(7221): 470-6.
  • Trapnell, C., L. Pachter, et al. (2009). "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25(9): 1105-11.
  • Au, Kin Fai, Jiang, Hui, Lin, Lan, Xing, Yi, Wong, Wing Hung (2010). "Detection of splice junctions from paired-end RNA-seq data by SpliceMap." Nucleic Acids Research 38(14), 4570-4578.
  • Wang, Kai et al. (2010). "MapSplice: accurate mapping of RNA-seq reads for splice junction discovery." Nucleic Acids Research 38(18), e178, doi:10.1093/nar/gkq622.
To Top