Recommendations & Troubleshooting

Recommendations

Minimum read depth filter -c

Accurate haplotype frequency estimation requires a minimum read count which is different between sample type (individuals and Pool-Seq) and ploidy levels.

For diploid individuals the odds of seeing both alleles at least once (which are the same if homozygous and different if heterozygous) is equal to 1 minus the odds of only seeing one allele.
../_images/formula_diploid_11.png
with c the read count. This is visualized in the image below as the green line, the black lines represent a 95% chance (6 reads) and a 99% chance (8 reads).
However due to the prevalence of sequencing errors it is advisable to see each allele at least twice, represented by the blue line. The formula for this curve is an extension of the one used for 1 sighting, on top of that formula, all combinations wherein an allele is seen only once are removed.
../_images/formula_diploid_21.png
For two sightings per allele, the 95% boundary is 9 reads and the 99% boundary is 12 reads.
For >2 sightings per allele one can use the function:
../_images/formula_diploid_31.png

# .. image:: ../images/sites/SMAP_haplotype_diploid_ind_read_count_requirement.png

Therefore, the user is advised to use the read count threshold to ensure that the reported haplotype frequencies per locus are indeed based on sufficient read data. If a locus has a total haplotype count below the user-defined minimal read count threshold (option -c; default 0, recommended 10 for diploid individuals, 20 for tetraploid individuals, and 30 for pools) then all haplotype observations are removed for that sample.


Troubleshooting

FASTQ Sequence identifier format

SMAP haplotype-window does not support old Illumina sequence ID’s like the example entry shown below, the reason being that the # blocks the read ID’s cluster coordinates. In order to solve this it suffices to replace the # with a single whitespace, for example with the command sed -i 's|#| |g' *.fq .
Also note that the quality encoding in the example below is in the old Phred+64 format, this does not present any issue.
@ILLUMINA-52179E_0009:8:1:1057:18188#CAGATC/1
ATCGCGGGCAACGGCAGCGCCAGNTAGGGCGGCGCCGGCTACGTTTCCTG
+ILLUMINA-52179E_0009:8:1:1057:18188#CAGATC/1
dcddddcZ`^Lb^bbccddTb^cBTLTbSPL_F_]Y`b_YL]\ILK_\[Z

Output

When opening the output (Tab delimited) .tsv files in Microsoft Excel, one might encounter the error that certain rows contain 1 column and others 2 columns, making it impossible to use the built-in option Data -> Text to Columns.
In order to circumvent this issue, it is best to open a new Excel-file and use the option Data -> Get Data -> From Text/CSV.