Introduction

SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It is an updated version of SOAP software for short oligonucleotide alignment. The new program features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer. Compared to soap v1, it is one order of magnitude faster. It require only 2 minutes aligning one million single-end reads onto the human reference genome. Another remarkable improvement of SOAPaligner is that it now supports a wide range of the read length.

SOAPaligner benefitted in time and space efficiency by a revolution in the basic data structures and algorithms used.The core algorithms and the indexing data structures (2way-BWT) are developed by the algorithms research group of the Department of Computer Science, the University of Hong Kong (T.W. Lam, Alan Tam, Simon Wong, Edward Wu and S.M. Yiu).

System Requirements

1. Hardware:
  a) 64-bit x86-64 CPUs with SSE instructions.
  b) 8 GB main memory ( for a genome as large as human’s).
  c) 8 GB hard disk (for a genome as large as human’s).
2. Software:
  a) 64-bit Linux system (kernel >=2.6).

Download

NOTE: Due to the copyright about some parts of source code, in current version, we can not open the SOAPaligner/soap2’s source code. If you want to use SOAPaligner/soap2 in other platforms, please feel free to contact us and you need to show your CPU architecture and OS kernerl version. And because the data structure is incompatible with 32bit systems, we will NOT provide relevant version for you.

Release 2.21 , 02-14-2011 New!

For GNU Linux X86_64 : download ( MD5: 563b8b7235463b68413f9e841aa40779 )

Release 2.20 , 08-13-2009

For GNU Linux X86_64 : download ( MD5: f9dc6fddbb2087959221447062c7ec6c )
SOAPaligner-v2.20-src : download (MD5: ca75753697b12749c42356f366738fef ) under (GNU/GPL v3) New!
SOAPaligner_builder    : download (MD5: e130bf9d50d0b82604cba591c0f92796) Source with index builder.
For MAC OS X              : download (MD5: 00134fe0bdf1c7ab1109b99a4cc09340 )
Compile enviroment:
System Version: Mac OS X 10.6.3 (10D573); Kernel Version: Darwin 10.3.0 x86_64 ; gcc version 4.2.1

New utility for SOAP:
1 soap2sam.pl : a format convertor.
2 soap.coverage can calculate sequencing coverage or physical coverage as well as duplication rate and details of specific block for each segments and whole genome by using SOAP, BLAT, BLAST, BlastZ, mum- mer and MAQ aligement results with multi-thread.
soap.coverage : version 2.7.7 Download (MD5:7cf98626e3573d680ed0e767207bfa95)

Release 2.19 , 07-13-2009

For GNU Linux X86_64 : download ( MD5: f72210a472d3341c80c6c7aa0abecdf1 )
NOTE:
Here is an additional version for SOAPaligner v2.19 ,that supports gzip I/O : SOAPaligner-v2.19-gz.tar.gz (MD5: 6f8b3503a990cc00e45c3bdb8eff5985 )

Release 2.18 , 05-25-2009

CHANGE:
1. fix segment fault when do gap alignment and multithreads function
2. fix bugs some start postion <0 or > chrLength
3. -l option compatible with diff read_length
4. -s min_length after soft clip
5. seq and quality real length is coordinated by soft clip
6. MD contain no 0 except first

For GNU Linux X86_64 : download ( MD5: 36b24eb23aadde0d6dbed238cf5e58be )

Release 2.17 , 04-03-2009

For GNU Linux X86_64 : download ( MD5: 3fc5fc80a90ef92a6db9644a452b4522 )

Release 2.16 , 03-31-2009

CHANGE: Fix SegmentFault when -r 2
For GNU Linux X86_64 : download ( MD5: f6fdb463aa5b1d315625976de71540f4 )

Release 2.15 , 03-27-2009

CHANGE: Fix bugs when do gap alignment.
For GNU Linux X86_64 : download ( MD5: 10ee28d3a00cb87fa131080f5b2e7232 )

Release 2.11 , 03-17-2009

CHANGE: Fix bugs.
For GNU Linux X86_64 : download ( MD5: 5bfbc46584a56c3178499d0e45c8999c )

Thanks all the user for testing the program and reporting bugs, especially Shawn Cokus and David Casero Diaz-Cano at UCLA, Heng Li at Sanger Institute and Junjie Qin at BGI Shenzhen.

Release 2.10 , 03-03-2009

CHANGE:
1. Allow more than 2 mismatches at 3'-ends when align long reads (>35bps);
2. Add the multithreads function.
For GNU Linux X86_64 : download ( MD5: e5a984d62054a5c256efcab79e958a7f )

Release 2.01 , 11-24-2008

CHANGE: Fix some bugs.
For GNU Linux X86_64 : download ( MD5: a78aa68373ae04525c5122b4b16e60d8 )

Release 2.01-Beta , 11-17-2008

For GNU Linux X86_64 : download ( MD5: 97c3a7902bfd1340aea12e0638933095 )

Release 2.00 , 11-13-2008

For GNU Linux X86_64 : download ( MD5: 37d7a2751fbe8c097abedf364a599f39 )

For MAC 0S X (64 bit): download


NOTE :
1. New!Now we offer a sort tool (named "msort") for SOAPaligner: msort.tar.gz | MORE
2. All above releases for Linux were built on suse 11 64-bit with 2.6 kernel.

Installation

  • Download the SOAPaligner above .
  • In the Linux console, type:
  • cd <TheDirectoryYouPutTheTarball>
    tar zxvf SOAPaligner.tar.gz
    cd SOAPaligner
  • In your directory there are 2 executable files, 2bwt-builder and soap.
To Top

Command Line Options

To run SOAPaligner, we need to build index files for the reference genome, and then search reads against the formatted index files.

1.Format reference sequence:

<ExecutablePath>/2bwt-builder <FastaPath/YourFasta>
eg: ./2bwt-builder ~/human_genome.fa

Then under the directory there will be 13 index files, all their prefixes are your_fasta file name with “.index” added, e.g. human_genome.fa.index. The suffixes include *.amb, *.ann, *.bwt, *.fmv, *.hot, *.lkt, *.pac, *.rev.bwt, *.rev.fmv, *.rev.lkt, *.rev.pac, *.sa, and *.sai.

2.Alignment quick start:

For alignment of single-end reads:

./soap –a <reads_a> -D <index.files> -o <output></output>

For paired-end reads:

./soap –a <reads_a> -b <reads_b> -D <index.files> -o <PE_output> -2 <SE_output> -m <min_insert_size> -x <max_insert_size>

NOTE: For the –D option, the program can only accept the prefix of your index files, such as “~/human_genome.fa.index”.

3.Options:

-D   STR   Prefix name for reference index [*.index].
-a   STR   Query file, for SE reads alignment or one end of PE reads
-b   STR   Query b file, one end of PE reads
-o   STR   Output file for alignment results
-2   STR   Output file contains mapped but unpaired reads when do PE alignment
-u   STR   Output file for unmapped reads, [none]
-m   INT   Minimal insert size INT allowed for PE, [400]
-x   INT   Maximal insert size INT allowed for PE, [600]
-n   INT   Filter low quality reads contain more INT bp Ns, [5]
-t     Output reads id instead reads name, [none]
-r   INT   How  to  report repeat hits, 0=none; 1=random one; 2=all, [1]
-R     RF alignment for long insert size(>=  2k  bps)  PE  data, [none] FR alignment
-l   INT   For  long  reads  with  high  error rate at 3'-end, those
       can't align whole length, then  first  align  5'  INT  bp
       subsequence as a seed, [256] use whole length of the read
-v   INT   Totally allowed mismatches in one read, [2]
-M   INT   Match mode for each read or the seed part of read,  which
       shouldn't contain more than 2 mismaches, [4]
       0: exact match only
       1: 1 mismatch match only
       2: 2 mismatch match only
       3: [gap] (coming soon)
       4: find the best hits
-p   INT   Multithreads, n threads, [1]
			
To Top

Evaluation

SOAPaligner needs about 2 hours to format the reference sequence and build indexing tables. The RAM usage is depending on the total size of the reference sequence. For the human reference genome, it will occupy 7GB RAM.

Table 1. Performance of aligning 1 million single-end reads (35bp read length) or 1 million read pairs onto the human reference genome

  Time (sec)Single-end reads Time (sec)Paired-end reads RAM (GB)
SOAPaligher(soap2) 120 505 6.8
soap 1700+ 5743 13.4

Future Development

  • Binary soap alignment output, and .gz input and output;

Acknowledgements

We appreciate Prof. T.W. Lam, Alan Tam, Simon Wong, Edward Wu and S.M. Yiu prominent work on 2way-BWT.



To Top