Recommendations & Troubleshooting

Recommendations

In general, for each option a recommendation is given. Please carefully follow the specific instructions per data type (HiPlex, Shotgun, or GBS, individuals or Pool-Seq, diploid or tetraploid).
Because haplotyping in GBS and Shotgun SVs takes SMAPs into account as molecular markes, and HiPlex data and Shotgun sliding frames do not, it is mandatory to specify this; use -partial include for GBS and Shotgun SVs and -partial exclude for HiPlex and Shotgun sliding frames.
Pool-Seq data may be analysed with different thresholds for minimum haplotype frequency, and optimize -f values to retain low-frequency haplotypes (range 1-5%) while suppressing noise.
To optimize discrete haplotype calls in individuals, it can be useful to first conduct a test run with default values for option --frequency_interval_bounds, and inspect the haplotype frequency plots and then define custom thresholds if required.

Why you should not run SMAP delineate on HiPlex data to identify SMAPs

Haplotyping requires the definition of a start and end point to bundle polymorphic sites. For HiPlex data, these SMAP sites are simply defined by the primer binding positions, using a Python script provided under Utilities (see further …). In GBS data and in some forms of Shotgun data (see Structural Variants), SMAPs define biologically informative sites that can be recognised (hence genotyped) based on the consistent mapping patterns at individual read level. By comparison to Shotgun and GBS data, preprocessing of HiPlex amplicon sequencing reads may be more difficult, as technical issues hamper accurate positional and pattern trimming. First, in highly multiplex amplicon sequencing, each amplicon contains a unique pair of primers. Trimming these primers off by pattern matching requires searching for all possible combinations of primers in each read, which may become computationally prohibitive. Second, as each primer typically has a slightly different length (range 18-27 bp), trimming a fixed length off the 5’-end and 3’-end of each amplicon does not yield genomic fragments with a sequence exactly internal to the primers. Third, as highly multiplex amplicon sequencing reads do not always start at the first (5’ end) nucleotide of the primer, positional trimming of reads by any fixed length would create a similar distribution of variable sequence start points, and thus artefactual Stack Mapping Anchor Points (SMAPs).

Minimum read depth filter -c

Accurate haplotype frequency estimation requires a minimum read count, which depends on sample type (individuals and Pool-Seq) and ploidy levels.

For diploid individuals the odds of seeing both alleles at least once (which are the same if homozygous and different if heterozygous) is equal to 1 minus the odds of only seeing one allele.
../_images/formula_diploid_1.png
with c the read count. This is shown in the graph below as the green line, the black lines represent a 95% chance (6 reads) and a 99% chance (8 reads).
However due to the prevalence of sequencing errors it is advisable to detect each allele at least twice, represented by the blue line. The formula for this curve is an extension of the one used for 1 observation, and in addition all combinations wherein an allele is seen only once are removed.
../_images/formula_diploid_2.png
For two observations per allele, the 95% boundary is 9 reads and the 99% boundary is 12 reads.
For >2 observations per allele this function applies:
../_images/formula_diploid_3.png ../_images/SMAP_haplotype_diploid_ind_read_count_requirement.png

Therefore, the user is advised to use the read count threshold to ensure that the reported haplotype frequencies per locus are indeed based on sufficient read data. Per sample, all haplotype observations are removed for loci with a total haplotype count below the user-defined minimal read count threshold (option -c; default 0, recommended 10 for diploid individuals, 20 for tetraploid individuals, and 30 for pools).

Frequency interval bounds and dosage filter for individuals

discrete dosage calls for diploids (0/1/2)

Use this option if you want to customize discrete calling thresholds. Haplotype calls with frequency below the lowerbound percentage are considered “not detected” and receive dosage `0´. Haplotype calls with a frequency between the lowerbound and the next percentage are considered heterozygous and receive haplotype dosage `1´. Haplotype calls with frequency above the upperbound percentage are considered homozygous and scored as haplotype dosage `2´. default <10, [10:90], >90 . Should be written with spaces between percentages, percentages may be written as floats or as integers [10 10 90 90].

e.g. --discrete_calls dosage --frequency_interval_bounds 10 10 90 90 translates to: haplotype frequency < 10% = 0, haplotype frequency > 10% & < 90% = 1, haplotype frequency > 90% = 2.

Visualized examples of these thresholds can be found in these tabs.

Why use dosage filter to remove low quality genotype calls (-z)?

The dosage filter -z is an additional mask specifically for dosage calls in individuals. It masks loci within samples from the dataset (replaced by -u or --undefined_representation) based on total dosage calls (= total allele count calculated from haplotype frequencies using frequency interval bounds). It is important to distinguish between total dosage call and total number of unique alleles per locus per sample. A tetraploid individual for example is expected to contain a total dosage call of 4 alleles, but can contain from 1 up to 4 unique (different) alleles:

locus

dosage

total dosage call

number of unique alleles

.

a

b

c

d

.

.

aaaa

4

0

0

0

4

1

aaab

3

1

0

0

4

2

aabb

2

2

0

0

4

2

abcc

1

1

2

0

4

3

abcd

1

1

1

1

4

4

The dosage filter -z evaluates the total dosage call against the expected number of alleles (2 in diploids, 4 in tetraploids), but does not consider the number of unique alleles. In general, the expected total dosage call for any locus is equal to the ploidy of the individual (except in exceptional cases such as aneuploidy).

Consider the examples of a single locus in a few samples in the tabs below for illustration of the combined functions of -f (minimum haplotype frequency), --frequency_interval_bounds and -z (dosage_mask).

../_images/dosage_filter_2n.png

The dosage filter is applied last (after all other filters). An adequate value for the filter -f (minimum haplotype frequency) is especially useful to reduce the number of masked calls across the sample set.
For example, in Sample2 in the diploid example above, a haplotype (c) occured at 4.7% of the locus read depth. Removing this haplotype using the option -f, also means removing the associated read counts from the read count table and recalculating the relative frequencies of the other haplotype based on the remaining read counts.
Recalculated frequencies of haplotype a would be 92.5% and haplotype d 7.5%. Then, discrete calling would lead to dosage calls of 2 for haplotype a and 0 for haplotype b, adding to a total dosage call of 2 (haplotype aa, homozygous individual).
Additionally, the --frequency_interval_bounds can be adjusted by the user, based on the haplotype frequency distribution plot. Lowerbound and upperbound thresholds should be set to remove the extremes in the haplotype frequency graphs, to reduce noise before discrete genotype calling and masking.

Troubleshooting

Output

When opening the output (Tab delimited) .tsv files in Microsoft Excel, one might encounter the error that certain rows contain 1 column and others 2 columns, making it impossible to use the built-in option Data -> Text to Columns. In order to circumvent this issue, it is best to open a new Excel-file and use the option Data -> Get Data -> From Text/CSV.


FAQ