HiPlex amplicon sequencing

Setting the stage

In HiPlex sequencing, haplotypes are defined by a set of SNPs between two fixed Anchor points (start and end positions of the locus). The Anchor points are typically defined as the nucleotide positions immediately interior to the primers used for HiPlex library preparation.

Defining the start and end point for haplotyping in HiPlex loci

The length of amplicons generated by HiPlex libraries is typically short (100-200 bp), so that paired-end reads (2x150 bp) overlap in the middle of the fragment and should be merged during preprocessing (e.g. with PEAR) before read mapping with BWA-MEM. As a consequence, all merged reads span the reference genome region between the two primers used for amplification, and the locus used for haplotyping is defined as the region flanked by the primers, but excluding the primer binding regions themselves. Here, we explain why it is important to keep the primer sequences on the ends of reads before read mapping, but not to include those primer binding regions for haplotyping.

primer sequences perfectly match the reference and support read alignment towards the ends

Because primer sequences are typically designed on the reference sequence used for read mapping, they can help create a good alignment to the reference sequence by adding stretches of perfect match towards the ends of the read. Therefore, to keep the longest possible sequence context for read alignment, we recommend to retain the primer sequence in HiPlex reads before read mapping to the reference with BWA-MEM, and not to trim primers off during read preprocessing.

primer sequences perfectly match the reference and can not report sample-reference polymorphisms

Primers are synthesized prior to the PCR, and are brought into the PCR reaction to become incorporated in the PCR-amplicon molecule. Hence, by definition, the region of the read that corresponds to the primer sequence can not report SNPs between the genotype under study and the reference sequence. Consequently, primer binding sites themselves should be excluded from the haplotyped region. While in principle the position of Anchor points can be fully customized, they are typically positioned immediately interior to the primer binding sites, which are a priori known by design of the primers. A python script that transforms pairs of primer binding sites into the BED file with SMAPs required for SMAP haplotype-sites is provided in the Utilities.

Recognizing haplotypes

The tabs below show the same locus/amplicon in 3 diploid individuals. A total of 4 SNPs are found among the individuals; 4 haplotypes can clearly be defined (1 is the same as the reference sequence and 3 alternative haplotypes). In practice, likely due to amplification, ligation, sequencing and/or quality trimming artefacts, HiPlex reads often do not perfectly align to the start of the primer sequences. In this sense, HiPlex read mappings often show `fuzzy´ Stacks. But as long as reads cover the locus start Anchor point (just downstream of the forward primer), and the locus end Anchor point (just upstream of the reverse primer), then these reads are taken into account and SNPs in the interior sequence of the locus can all be haplotyped by read-backed haplotyping.

The reconstructed haplotype codes are shown on the right. Light blue arrows and characters indicate locus start and end Anchor points, red arrows and characters indicate SNPs. More information on haplotype calling is found below the tabs.

Although in this figure a single base deletion can be seen on position 58, it is not called as it does not coincide with a SNP or Anchor position.

Step 1: Combining SNPs between Anchor points into sets of known polymorphic sites

procedure

For HiPlex data, the user should create a custom BED file which contains the Anchor points to delineate loci for haplotyping (see instructions here). Reads not spanning these Anchor points, as well as SNPs positioned outside these regions are not taken into account for haplotype calling.

../../../_images/SMAP_haplotype_step1_AS.png

SMAP haplotype-sites uses bedtools intersect to select sets of SNPs from the VCF file that are located inbetween the Anchor points per locus.

Then, a set of putatively polymorphic sites is made per locus (amplicon) by concatenating the locus start Anchor point, all internal SNPs, and the locus end Anchor point, all ordered by position.

Non-polymorphic loci are not analyzed

HiPlex loci defined in the BED file that do not contain overlapping SNPs are uninformative and are excluded from further analysis. These loci are also not reported in the output.

Step 2: Calling and counting haplotypes

Read-backed haplotyping is used to reconstruct haplotypes.

../../../_images/SMAP_haplotype_step2_AS.png

procedure

The following procedure is performed per sample:

For each locus listed in the BED file, all overlapping reads are retrieved from a BAM file.
For each read, a haplotype string is created as follows:
For each putatively polymorphic site, but including the outer Anchor points, the corresponding position in the aligned read is checked and reference/alternative/absence/gap is scored:

CALL TYPE

CLASSES

.

absence of read mapping (indicating partially aligned read)

0

presence of the reference nucleotide

1

presence of an alternative nucleotide (any nucleotide different from the reference)

-

presence of a gap in the alignment

Deletions (-) are only scored when they overlap with SNP positions, these are not considered as polymorphic positions. Likewise, insertions are not considered polymorphic positions, and moreover these are not called. In HiPlex-data, InDels do not modify the start and end points of read mapping (SMAPs) like in GBS data, as loci are defined by primer sequence ends.

The concatenated string of `.01-´ scores then defines the haplotype per read.

In principle, this procedure allows to detect any linear combination of SNPs (i.e. haplotype) per locus, but only experimentally observed haplotypes are recorded.

SMAP haplotype-sites reconstructs this haplotype string for each read, and counts the number of reads per haplotype per sample. Next, the sum of all read counts per haplotype per locus is calculated (total locus read count per sample), filtered for minimal read count per locus (option -c), and all information is stored in a table per sample.

../../../_images/SMAP_haplotype_step3_AS.png

filters

loci with low read count are removed from the dataset with a read count threshold (option -c)

Accurate haplotype frequency estimation requires a minimum read count which is different between sample type (individuals and Pool-Seq) and ploidy levels.

The user is advised to use the read count threshold to ensure that the reported haplotype frequencies per locus are indeed based on sufficient read data. If a locus has a total haplotype count below the user-defined minimal read count threshold (option -c; default 0, recommended 10 for diploid individuals, 20 for tetraploid individuals, and 30 for pools) then all haplotype observations are removed for that sample. For more information, see recommendations on minimal read depth.

Haplotypes with gaps on one of the polymorphic sites can be removed from the dataset (option --no_indels)

In some cases, gaps in the alignment (putatively caused by InDels) may overlap with SNP sites in individual reads. The option --no_indels filters out any haplotypes that contain `-´ characters in their haplotype string, and recalculates the total read count per locus.

Using the option -mapping_orientation ignore for HiPlex data

For HiPlex data, use the option -mapping_orientation ignore to collect all reads per locus independent of the strand that the reads are mapped on (i.e. ignoring their mapping orientation). Depending on HiPlex library construction methods, the amplicons may be sequenced directionally (by incorporating the Illumina sequencing adapters during (nested) PCR-amplification, leading to directional read mapping onto the reference sequence), or non-directionally (by ligation of Illumina adapters after a first round of HiPlex PCR-amplification, leading to equal proportions of reads mapped in forward and reverse orientation onto the reference sequence). In any case, all reads amplified with the respective primer pair will span the entire region of the locus, and mapping orientation may be ignored when collecting the reads that overlap with the locus.

Using the option -partial exclude for HiPlex data

Haplotypes are extracted from HiPlex, Shotgun, and GBS data using the same principles of read-backed haplotyping. This means that the same algorithm for haplotype calling and haplotype frequency counting can be used, but with specific options and BED files per library preparation method. Because mapped HiPlex reads are expected to span the entire length of the locus, we recommend to remove reads that are partially aligned by using the option -partial exclude.

The special case of overlapping amplicons is resolved by using option -partial exclude for HiPlex data

Some complementary HiPlex amplicon sets may contain partially overlapping amplicons. These are generated in independent PCR reactions, but all merged reads can be mapped together in a single BAM file per sample, and analyzed with a BED file that contains all loci. The scheme below shows how using option -partial exclude makes sure that reads are only evaluated and counted for loci with complete coverage (read mapping at both start and end Anchor points per locus).

../../../_images/A_Hiplex_PE_merged_option_no_partial.png

SMAP haplotype-sites evaluates all reads that overlap with at least a single nucleotide at a given locus. For HiPlex data analysis, where option -partial exclude should be used, SMAP haplotype-sites first evaluates if a read spans the entire length of the locus to which it is mapped. See scheme above. For locus 1, only the black reads that align entirely to the first locus are scored for SNPs 1, 2, and 3, while grey reads derived from locus 2, and with only partial alignment to locus 1, are ignored. Conversely, only the grey reads that align entirely to the second locus are scored for SNPs 3, 4, and 5, while black reads derived from locus 1 are ignored. The option -partial exclude thus evaluates each read only once, does not inflate the total read count per amplicon, and does not attempt to extend the sets of neighboring SNP sites beyond the reach of reads.