Introduction

SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.

System requirements

SOAPdenovo aims for large plant and animal genomes, although it also works well on bacteria and fungi genomes. It runs on 64-bit Linux system with a minimum of 5G physical memory. For big genomes like human, about 150 GB memory would be required.

Download

Release 1.04 , 21-12-2009 NEW !

CHANGE: Improved gap filling module.
Download SOAPdenovo ( MD5: 016620ea7012ab57c6a580a11b96f716 )
Note :

1 Now (2010-01-25) we release a tool named GapCloser for SOAPdenovo: Download | Know more

2 Only precompiled binary version available now.

Release 1.03 , 06-09-2009

CHANGE: Fixed some bugs.
Download SOAPdenovo ( MD5: ca888397c74fd095fdee59ea187646cb )

Release 1.02 , 06-02-2009

CHANGE: Fixed some bugs.
Download SOAPdenovo ( MD5: 1f301cfe6fcc1d23b1a005f38d2ffebb )

Release 1.01 , 05-25-2009

CHANGE: Fixed some bugs.
Download SOAPdenovo ( MD5: 989d666fb1a32e97a80ea1525193fb34 )

Release 1.0 , 05-11-2009

Download SOAPdenovo ( MD5: f8999538f7b3704e80d31a05de1a5407 )
Note:

1 Now(2009-07-03) we release a correction tool for SOAPdenovo : Download | Know more

2 Only precompiled binary version available now.

Installation

  1. Download the SOAPdenovo tar package;
  2. Unpack it;
  3. There are one executable file "soapdenovo" and one demo configure file "example.contig"

Command Line Options

1. Configuration file

For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. So you have to instruct the program where to find the input data. "example.config" demonstrates how to organize the information and make configuration file.

The configuration file has a section of global information, and then multiple library sections. The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Right now only the information of maximal read length is included in the global information section. Each library section starts with tag [LIB] and is followed by read file names along with their paths, read file format, average insert size, library ranks and two other flags that tell the assembler how to treat these reads.

The assembler accepts read file in two formats: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair.

Libraries with the same "rank" are used at the same time for scaffolding in the order indicated by "rank".

The flag "asm_flag" has three eligible values: 1 (reads only used for contig assembly), 2 (only used for scaffold assembly) and 3 (used for both contig and scaffold assembly).

There are two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. User should set parameter for tag "reverse_seq" to indicate this: 0, forward-reverse; 1, forward-forward.

2. Get it started

Once the configuration file is available, the simplest way to run the assembler is:

./soapdenovo all -s config_file -o output_prefix

User can also choose to run the assembly process step by step as:

./soapdenovo pregraph -s config_file -o output_prefix
./soapdenovo contig -g output_prefix
./soapdenovo map -s config_file -g output_prefix
./soapdenovo scaff -g output_prefix

3. Options:

	-s	STR	configuration file
	-o	STR	output graph file prefix
	-g	STR	input graph file prefix
	-K	INT	K-mer size [default 23]
	-p	INT	multithreads, n threads [default 8]
	-R		use reads to solve tiny repeats [default no]
	-d		remove low-frequency K-mers with single occurrence [default no] 
	-D		remove edges comprised by entirely single frequency K-mers [default no]
	-F		intra-scaffold gap closure [default no]
	-L		minimum contigs length used for scaffolding
     

4. Output files

These files are output as assembly results:

*.contig contig sequence file
*.scafSeq scaffold sequence file

There are some other files that provide useful information for advanced users.

FAQ

1. How to set K-mer size?

The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.

2. How to set library rank?

SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.


APPENDIX: example.config
#maximal read length
max_rd_len=50
[LIB]
#average insert size
avg_ins=200
#if sequence needs to be reversed 
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#in which order the reads are used while scaffolding
rank=1
#fastq file for read 1 
q1=/path/**LIBNAMEA**/fastq_read_1.fq
#fastq file for read 2 always follows fastq file for read 1
q2=/path/**LIBNAMEA**/fastq_read_2.fq
#fasta file for read 1 
f1=/path/**LIBNAMEA**/fasta_read_1.fa
#fastq file for read 2 always follows fastq file for read 1
f2=/path/**LIBNAMEA**/fasta_read_2.fa
#fastq file for single reads
q=/path/**LIBNAMEA**/fastq_read_single.fq
#fasta file for single reads
f=/path/**LIBNAMEA**/fasta_read_single.fa
#a single fasta file for paired reads
p=/path/**LIBNAMEA**/pairs_in_one_file.fa
[LIB]
avg_ins=2000
reverse_seq=1
asm_flags=2
rank=2
q1=/path/**LIBNAMEB**/fastq_read_1.fq
q2=/path/**LIBNAMEB**/fastq_read_2.fq
q=/path/**LIBNAMEB**/fastq_read_single.fq
f=/path/**LIBNAMEB**/fasta_read_single.fa
			


To Top