Introduction

SOAPindel is focusing on calling indels from the next-generation paired-end sequencing data.

Requirements

SOAPindel needs two input data sources:

  1. The reference sequence file used to align the reads. It must be in Fasta format.
  2. The files with read-alignments. SOAPindel accepts both SOAP and SAM formats as input. When the input files are in SOAP format, users must also provide the raw reads files in Fasta or Fastq format. SOAPindel can guess the library insert sizes by itself, but if users could provide the correct ones, it will save some time.

Sequencing platform:Theoretically, SOAPindel is designed for all paired-end sequencing data because it doesn’t consider any qualities for now. So it works for both Ilumina GA data and 454 data, but we’ve only tested it on Ilumina GA data.

Download

SOAPindel Release 2.1,2014-06-03 (540K)New!

download (MD5:317ef494173969cdc6a8244dd87d06bd)

SOAPindel Release 2.1,2013-02-20 (655K)

download (MD5:b66d6034e3dc3796b4397fe9697d636e)

SOAPindel Release 2.01,2012-09-26 (920K)

download (MD5:a4b40e9afc3c7e49fb2f61e39720455c)

SOAPindel manual 2.01,2012-09-26 (28k)

download (MD5:c66fb8674e26798a4fb450c29e734edd)

SOAPindel Release 2.0,2012-03-29 (692K)

download (MD5: d51e31b6cac5553f424efd0e3742a6ed)

SOAPindel Release 1.2,2012-03-14 (632K)

download (MD5: a0f8216d945a68980620e608cef2182e)

SOAPindel Release 1.1,2012-02-03 (230K)

download (MD5: 8fd5825bc795fb7efaccc85b60394409)

SOAPindel Release 1.01,2011-11-17 (232KB)

download (MD5: 7bd3e9ac9c1cfe168db1e90402c0576e)

SOAPindel Release 1.0,2011-11-14 (228KB)

download (MD5: 27a71b8e9b174083ece6bfa4002179ae)

SOAPindel_manual 1.0, 2011-11-24 (51.2KB)

download (MD5: cfd7d7d6cc3bfd51df8d096c8a4398bc)

Workflow

  1. Get fuzzy postions of unmatched reads
  2. Sort fuzzy postions
  3. Find possible SNPs in all potential regions
  4. Cluster unmatched reads to regions
  5. Split reference sequence
  6. Get unmatched reads
  7. Get matched reads
  8. Do local assembly and alignment to get indels
  9. Filter results

Command & Input files

perl indel_detection.pl < mapping.list > < reference_seq(DIR|FILE) > [ < reads.list > ]

  1. < mapping.list >: reads mapping files list. File format: < path_of_alignment_file > lib_insert_size deviation_of_insert_size reads_length type_of_file
    • < path_of_alignment_file >: better set to absolute path
    • lib_insert_size: insert size of library
    • deviation_of_insert_size: normally set to SD*2 (~20% of library insert size)
    • reads_length: max read length in the alignment file
    • type_of_file: this value should be set to SINGLE,PAIR or BOTH
  2. < reference_seq(DIR|FILE) >: reference sequence folder or fasta file. folder: All reference sequence files must be put in this folder and named like < chromosome_name.fa >. Each file contains only one chromosome. Every file should be in Fasta format and have id: “>chromosome_name “ (case sensitive).
    fasta file: all sequence must be put in a fasta file and have id:“>chromosome_name “.
  3. < reads.list >: reads files list. Only needed by SOAP, just put the absolute path of files in each line.
    1. If you put pared-end reads in separate files, the pared-end reads id could be same but the file name must contain _a, _b or _1, _2 so SOAPindel can tell the order of reads
    2. If you put pared-end reads in one file, the reads name must be like reads_id/1 & reads_id/2.

The command could be run directly, but for the big data set (like one or more chromosomes), users are suggested to use “-p” to print out the script and run it manually (please see the details in Parameters section).

Parameters

For detail information,take a look at the user manual!

  • String & Number Parameters:
  • -chr chromosomes (ALL)

    -l layer_num (2)

    -k kmer size (25)

    -cpu cpu number (7)

    -ext extension length for every cluster (50)

    -w window size for successive homogeneous indels checking (30)

    -n max successive homogeneous indels #/window (2)

    -x_num read_max_hits# (1)

    -mm max_mismatch (int(read_len/25)+1)

    -il max gap between nails (100)

    -ol overlap length when cut cluster by max length (il)

    -xl max cluster length (ol*3)

    -fmt mapping format (SOAP) [SOAP|SAM]

    -sdp use bigger deviation (deviation*sdp) to filter insert size of pair aligned reads (2.0)

    -wd work_dir (.)

    -cm path_of_cross_match, only works when -ucm is set ()

    -st path of samtools, only need for .bam file

    -mc max contigs to align (100)

    -mx unique reads max reliable coverage (auto)

    -aa 0|1|2 (0)

  • Switch Parameters:
  • -p only print script, don't run

    -qseq print sample seq for PCR

    -qsub use qsub to submit jobs

    -pp print progress in log (debug)

    -no_fs don't filter reads with SNPs (debug)

    -ucm use cross_match to do local alignment (debug)

    -t test time and memoery (debug)

Output Files

  • Log
  • SOAPindel stores running log in the WORK_DIR/log folder. Users can use “cat WORK_DIR/log/*.log to check if there is something wrong during the process. If the “-pp” option switched on, users can see the progress in percentage.

  • Result
    • SOAPindel stores results in the WORK_DIR/result/chr*, one folder for one chromosome. There would be 8 files for each chromosome:
      • chr*_2L.mutation.raw
      • chr*_2L.mutation.raw2
      • chr*_2L.mutation.sorted
      • chr*_2L.mutation_sfo1.list
      • chr*_2L.heterozygous.list
      • chr*_2L.indel.list
      • chr*.HEAD
      • chr*_2L.indel.VCF
    • All filtered mutations (include SNPs) are listed in the file: WORK_DIR/result/chr*/chr*_2L.mutation_sfo1.list.
    • The indels are listed in the file: WORK_DIR/result/chr*/chr*_2L.indel.list.
    • chr*_2L.indel.VCF is the vcfv4.0 format file. chr*.HEAD is the head of vcf file.
    • All these files are same format. Here is the description of each columns:
      1. cluster regions id
      2. assembled contigs id
      3. chromosome
      4. type[-size] (S:SNP; D:Deletion; I:Insertion; N:Heterozygote Position)
      5. start position on reference chromosome
      6. local start position on the cluster.
      7. indel range (local start - local end: indels could happen on any position between the start and end, so the local end - local start + size = tandem repeat length, and tandem repeat length / size = repeat times)
      8. indel allele
      9. average coverage for insertion; minimum coverage for deletion
      10. minimum coverage for insertion; overlapping coverage for deletion
      11. “+” for homozygote; sample genotype / reference genotype ratio for heterozygote
      12. L(left flank length from the assembled contig start to indel start)_R(right flank length from indel end to the contig end)
      13. 10 bp left flank on reference_10 bp right flank on reference
      14. Empty by default; if the -qseq is set, this column is (left flank of the contig)_(right flank of the contig) without allele sequence.
      15. used internal
      16. successive homogeneous indels #/window size
      17. used internal
    • there is one statistics file: WORK_DIR/result/chrALL.stat.tab.
To Top