Introduction

SOAP3 is a GPU-based software for aligning short reads with a reference sequence.It can find all alignments with k mismatches, where k is chosen from 0 to 3 (see Section 3.2 for other options including finding only the best alignments and trimming the reads). When compared with its previous version SOAP2, SOAP3 can be up to tens of times faster. For example, when aligning length-100 reads with the human genome, SOAP3 is the first software that can find all 3-mismatch alignments in tens of seconds per one million reads.

The alignment program in this package is optimized to work for multi-millions of short reads each time by running a multi-core CPU and the GPU concurrently.

To exploit the parallelism of the GPU effectively, SOAP3 is using an adapted version of the 2BWT index of SOAP2 (the new index is called the GPU-2BWT). The index and algorithms were developed by the algorithms research group of the University of Hong Kong (T.W. Lam, C.M. Liu, Thomas Wong, Edward Wu and S.M. Yiu).

System Requirements

1. Hardware requirement:
To run SOAP3, you need a linux workstation equipped with
  (i) a dual-core or quad-core CPU with at least 16GB main memory, and
  (ii) a CUDA-enabled GPU with at least 3GB memory.

SOAP3_aligner has been tested with the following GPU: NVIDIA Tesla C2070 (6GB memory) and Tesla M2050 (3GB memory).

When using a GPU with 3GB memory like Tesla M2050, you should disable the ECC function so as to free more memory for soap3_aligner.

2. Platform:
  SOAP3 was developed under the 64-bit linux platform and the CUDA Driver version 3.2.

3. Usage:
SOAP3 consists of 3 parts:
  (i) Index (2BWT and GPU-2BWT) builder;
  (ii) Aligner;
  (iii) Output viewer.
3.1 Index builder
Index builder preprocesses the FASTA reference sequence and generates the index files needed by soap3_aligner. Note the following restrictions on the input:

  (i) The reference sequence can contain at most 4 billion characters.
  (ii) All characters other than A, C, G, T will be replaced by character G. Any sequence of more than 10 consecutive invalid characters will be removed.
  (iii) No more than 256 sequences (chromosomes) in a single FASTA file.

Step 1: Build the 2BWT index.

Syntax:

% ./2bwt-builder

For example:

% ./2bwt-builder genome.fa

A number of files with the filename-prefix ".index" will be generated.

Step 2: Convert the 2BWT index to the GPU2-BWT index.

Syntax: % ./BGS-Build .index

For example:

% ./BGS-Build genome.fa.index

Additional files with the filename-prefix ".index" will be generated.

With all these index files, you are now ready to use soap3_aligner to perform alignment for the reads.

3.2 Aligner
Syntax:

% ./soap3_aligner <# of reads in query file> [options]

options: -m (from 0 to 3, default: 3)
              -h (1: all valid alignments; 2: all best alignments, default: 1)
              -t (to enable trimming of unalinged reads and then re-alignment)
              -l ( default: 20 )
              -n ( default: 2 )
The maximum read length is normally in the range [75, 200]. When the read length is below 100, it is advised to use a GPU with 6GB memory.

Example 1: The file query.fa contains one million length-100 reads, and the following command aligns the reads with default options (up to 3 mismatches,
reporting all valid alignments).

% ./soap3_aligner genome.fa.index query.fa 1000000 100
Example 2: Suppose that the maximum number of mismatches allowed is 2, and only the best alignments (with the fewest mismatches) are needed.

% ./soap3_aligner genome.fa.index query.fa 1000000 100 -m 2 -h 2

Example 3: Reads that cannot be aligned are further aligned with the last 25 characters trimmed.

% ./soap3_aligner genome.fa.index query.fa 1000000 100 -t -l 25

soap2_aligner produces two or more output files, depending on the number of threads being used. The result of aligning the original reads are stored in the
files with filename-prefix .gout, and that of the trimmed reads in the files with filename-prefix .gout.trim. Note that these
files are in binary format (see SOAP2). If you are familiar with such binary format, you may manipulate the data directly. Otherwise, you can use the
following viewer to convert it to plain text format.

3.3 Output viewer

As mentioned above, soap3_aligner outputs in binary format. The program make_view.sh converts the binary format into plain text form and merge all the output files into one.
Syntax:

% ./make_view.sh

For example:

% ./make_view.sh query.fa

It will output one file named .out, storing the results of aligning all the original reads in plain text format.
If the trimmed read option is chosen, another output file named .trim.out is produced storing the results of aligning the trimmed reads.

3.4 Multi-threading

By default soap3_aligner is using two 2 CPU threads. To change the number of CPU threads (to 1, 3, or 4), one can modify the parameter "NumOfCpuThreads" inside the file
"soap3_aligner.ini".

4. Reference and contact

SOAP3: GPU-based Compressed Indexing and Ultra-fast Parallel Alignment of Short Reads.

Download

Version 0.01 beta New!

download ( MD5: 59ea3267e073924cb9920bd7a52546bb )