Introduction

SOAPsnp is a member of the SOAP (Short Oligonucleotide Analysis Package). Despite its name, the program is a resequencing utility that can assemble consensus sequence for the genome of a newly sequenced individual based on the alignment of the raw sequencing reads on the known reference. The SNPs can then be identified on the consensus sequence through the comparison with the reference. In the first Asian genome re-sequencing project, evalution of SOAPsnp result on Illumina HapMap 1M BeadChip Duo genotyping sites shows great accuracy. Over 99% of the genotyping sites are covered at over 99.9% consistency. Further PCR plus Sanger sequencing of the inconsistent SNP sites confirmed majority of the SOAPsnp results.

SOAPsnp uses a method based on Bayes’ theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment, and recurring experimental errors. All these kinds of information was integrated into a single quality score for each base in PHRED scale to measure the accuracy of consensus calling. Currently, it supports the alignment format of SOAPaligner.

Download

Release 1.03 , 05-25-2009 NEW !

CHANGE: a subvertion of SOAPsnp with gziped file IO. You should confirm BOOST iostream library has been installed in your platform at first. And you can learn more about BOOST from http://www.boost.org .
Download SOAPsnp ( MD5: 8d69e196013657357ff840b611762ebc )

Release 1.02 , 04-10-2009

Download SOAPsnp ( MD5: d59db62ff7992cc9c8eac7303a8a3eb4 )

Release 1.01 , 03-30-2009

Download SOAPsnp ( MD5: 75e6e0005c56928b5c1a2e6854f02fac )

Release 1.00 , 11-13-2008

Download SOAPsnp ( MD5: b8aa0d0ad80518f9f1ef8a08d31e29de )

200 exomes

Download 200exomes

BGI and Danish Researchers have sequenced 200 human exomes of European ancestry from Denmark with an average of 12-fold coverage depth per sample to discover new, low-frequency variants by aggregating data from all 200 individuals. With this intermediate design between low-pass population sequencing and deep individual sequencing, we aimed to derive a high-resolution allele frequency spectrum of cSNPs with a minimum allele frequency of 0.02 to characterize the distribution of allele frequencies in a human population and to use this distribution to make inferences about the effect of natural selection in the human genome. With this scale of sequencing data, we uncovered more deleterious rare variants than previously believed. Here we presented the final SNP sets discovered in the 200 human exomes and listed the estimated MAF for the corresponding SNPs.

200exome non-synonymous SNP annotation

Download

"NA" means that there is not enough coverage depth for calling high-confidence genotype for that individual.

"1" denotes that the non-synonymous SNP was observed in that individual.

"0" denotes that the non-synonymous SNP was NOT observed in that individual.

System requirements

SOAPsnp is a command line driven program written in C/C++ that generally runs under 64-bit Linux system. The program has been tested on various platforms like x86-64 Xeon with Linux kernel 2.6.9 and Loongson 2E/2F with Linux kernel 2.6.22. It is in principle portable to other architectures/systems as only standard C++ libraries were used. GNU Compiler Collection (version>=3.4) is recommended to compile the codes.

The program needs ~500M or even smaller memory to run. However, its output might be very large that consumes a lot of harddisk space. In text output mode, the output file may be as large as 60 times the genome size (e.g. 180G free space is required to run a human genome). In GLF output format (which is proposed by Prof. R. Durbin in Wellcome Trust Sanger Institute), the output file approximately requires a free disk space of 12 times the genome size to store.

Installation

  • Download the tarball of the latest SOAPsnp version from the link above. (For example, SOAPsnp-vxxx.tar.gz)
  • In the Linux console:
  • tar zxvf /<PATH_WHERE_YOU_PUT_THE_TARBALL>/SOAPsnp-vxxx.tar.gz
    cd SOAPsnp/
  • Change the ‘makefile’ if necessary. For example, you may would like to modify the compiler optimization parameters.
  • In the Linux console:
  • make all

    Then an executable of SOAPsnp will be generated in the directory.
    In the Linux console, type:

    ./soapsnp

    or:

    <Absolute path>/soapsnp

    to run the program. You may copy the executable to /usr/bin/ or other system paths defined in the environment variables so that you can simply run the program by directly typing “soapsnp” in the console.

Quick Start:

For diploid genome resequencing:

soapsnp -i <Alignment.soap.sort.chrN> -d <chrN.fasta> -o <chrN.consensus> -r 0.00005 –e 0.0001 -t -u -L <Maximum Read Length> -M <chrN.mat>

For monoploid genome resequencing:

soapsnp -i <Alignment.soap.sort.chrN> -d <chrN.fasta> -o <chrN.consensus> -r 0.0001 -t -u -L <Maximum Read Length> -M <chrN.mat> -m
To Top

Usage

Command line options:

1. Required parameters:

-i <FILE> Input SORTED SOAP alignment result
Note that here we say “sorted’ means alignments of each chromosome are sorted first by chromosome name
lexicographically and then by coordinates on each chromosome numerically.

-d <FILE> Reference DNA sequence in FASTA format

-o <FILE> Output consensus file

2. Optional parameters:(default in [ ])

-z <Char> ASCII character that stands for quality score==0 [@]
FASTQ files generated by Illumina base-calling pipeline use ‘@’ as 0, but some institutes use ‘!’ as 0.

-g <Double> Global error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.9]

-p <Double> PCR error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.5]
	Sequencing errors are found slightly repeatable (once an error occur, additional errors also tend to
	occur) due	to various reasons. Therefore, observations of sequencing errors are not 
	complete independent.The main	source of repeatable errors is believed to be PCR amplification
	in sequencing process. The proper values	of	the two parameters rely on wetlab process. 
	Nonetheless, the default value generally work at most time.

-r <Double> novel altHOM prior probability [0.0005]

-e <Double> novel HET prior probability [0.0010]
	The two are prior probabilities of homozygous SNPs (altHOM) and heterozygous SNPs (HET), which are 
	used	in Bayes formula calculation. Note these are prior probabilities of a new (novel) SNP. They are
	expected to be	stringent. For different species, the two values should change if necessary.

-t set transition/transversion ratio to 2:1 in prior probability

-s <FILE> Pre-formatted known SNP information.
	The file consist of a lot of lines like this one:
		chr1    201979756       1       1       0       0.161   0       0       0.839   rs568
	The columns from left to right are: name of chromosome, coordinate on the chromosome, whether 
	the SNP	has allele frequency information (1 is true, 0 is false), whether the SNP is validated 
	by experiment (1 is true, 0 is false), whether the SNP is actually an indel (1 is true, 0 is false),
	frequency of A, frequency of C, frequency of T, frequency of G, SNP id. For known SNP sites that do
	not have allele frequency information, the frequency information can be arbitrarily determined as 
	any positive values, which only imply what alleles have already been deposited in the database.

-2 specify this option will REFINE SNP calling using known SNP information [Off]

-a <Double> Validated HET prior, if no allele frequency known [0.1]

-b <Double> Validated altHOM prior, if no allele frequency known[0.05]

-j <Double> Unvalidated HET prior, if no allele frequency known [0.02]

-k <Double> Unvalidated altHOM rate, if no allele frequency known[0.01]
	The parameters are related to using external SNP information to alter prior probabilities for SNP calling.
	SOAPsnp will try using allele frequency information as prior probability in calling genotypes for each site.
	If the allele frequency information is absent, it will use the above 4 parameters as prior probability.

-u Enable rank sum test (that check whether the two allele of a possible HET call have same sequencing quality)
	to give HET further penalty for better accuracy. [Off]

-n Enable binomial probability calculation (that check whether the two allele are observed equally)to give HET
	further penalty for better accuracy. [Off]

-m Enable monoploid calling mode, this will ensure all consensus as HOM and you probably should SPECIFY 
	higher altHOM rate. [Off]

-q Only output potential SNPs. Useful in Text output mode. [Off]

-M <FILE> Output the quality calibration matrix; the matrix can be reused with -I if you rerun the program

-I <FILE> Input previous quality calibration matrix. It cannot be used simutaneously with -M

-L <short> maximum length of read [45]
	Please note that once length of some reads exceeds the parameter will probably collapse the program.

-Q <short> maximum FASTQ quality score [40]

-F <int> Output format. 0: Text; 1: GLFv2; 2: GPFv2.[0]

-E <String> Extra headers EXCEPT CHROMOSOME FIELD specified in GLFv2 output. Format is 
	"TypeName1:DataName1:TypeName2:DataName2"[]

-T <FILE> Only call consensus on regions specified in FILE. Format of this file is:
	ChrName\tStart\tEnd
	ChrName\tStart\tEnd
	…

-h Display this help
To Top

Output Format

1.Text format

The result of SOAPsnp has 17 columns:

1)  Chromosome ID
2)  Coordinate on chromosome, start from 1
3)  Reference genotype
4)  Consensus genotype
5)  Quality score of consensus genotype
6)  Best base
7)  Average quality score of best base
8)  Count of uniquely mapped best base
9)  Count of all mapped best base
10) Second best bases
11) Average quality score of second best base
12) Count of uniquely mapped second best base
13) Count of all mapped second best base
14) Sequencing depth of the site
15) Rank sum test p_value
16) Average copy number of nearby region
17) Whether the site is a dbSNP.

2.GLFv2 and GPFv2

GLFv2 (Genome Likelihood Format v2) is a binary file format proposed by Prof. R. Durbin.



To Top