Feature Description

Definition of SMAPs, Stacks, StackClusters, and MergedClusters

Intuitively, GBS may be expected to create perfectly aligned `Stacks´ of reads after reference mapping. However, due to allelic sequence diversity, in reality the start and end points of mapped GBS reads often do not align at a given genomic locus, thus creating `fuzzy´ Stacks.
To emphasize the occurence of consistency in the variation of read mapping, SMAP delineate introduces three novel concepts in addition to Stacks, namely: SMAPs, StackClusters and MergedClusters.

SMAP delineate first creates Stacks by identifying sets of reads with identical read mapping start and end positions per sample. The start and end positions of such Stacks are called Stack Mapping Anchor Points (SMAPs), so that Stacks capture read mapping consistency. Stacks are then incrementally overlapped, so that StackClusters capture within-sample read mapping variation, and MergedClusters capture between-sample read mapping variation. See schemes below for graphical illustration of the concepts.

Schematic overview of SMAPs, Stacks, StackClusters, and MergedClusters.

../_images/SMAP_delineate_cut.png

Why polymorphisms (SNPs and InDels) give rise to variation in mapping positions of reads

Above, we illustrate the common observation that sequence polymorphisms affect the final read mapping positions.
So, paradoxically, while GBS is used to detect polymorphisms by genome re-sequencing of restriction enzyme site flanking loci, polymorphisms themselves affect each of the three main steps in the GBS procedure:
  1. GBS library construction (restriction enzyme site ligated adapters)

  2. sequencing (Illumina short reads with fixed length)

  3. mapping (BWA-MEM; Smith-Waterman seed-extension alignment)

Next, we explain the underlying reasons for each step.

Why Stacks exist in GBS data

Stacks are defined as sets of short reads with identical read mapping start and end positions (SMAPs)

../_images/Separately_mapped_vs_merged.png
Single-end GBS reads are derived from individual molecules (restriction fragments). If Illumina single-end read length is shorter than the PCR-amplified GBS fragment length (distance between two neighboring restriction enzyme sites (RE)), then that individual molecule is only partially sequenced (black arrows).
Paired-end GBS reads are derived from two sides of an individual molecule. If NGS read length is longer than half of the PCR-amplified GBS fragment length (distance between two neighboring REs), then those reads overlap at least partially in the middle of the GBS fragment. In this case, both reads cover a common sequence of the same molecule and the reads should be merged (by e.g. PEAR) in order to improve the base calling quality in the middle of the read, and to reduce redundancy of read depth (i.e. redundancy in reference genome coverage), and to create a long sequence spanning the entire GBS fragment from RE to RE.

Type of reads mapped

Start position

End position

separate reads (+ strand)

upstream restriction enzyme site (RE)

Start position + fixed read length

separate reads (- strand)

End position - fixed read length

downstream restriction enzyme site (RE)

merged reads

upstream restriction enzyme site (RE)

downstream restriction enzyme site (RE)

In practice, polymorphisms (SNPs and InDels) affect each of the three steps of GBS (restriction enzyme sites, effective alignable read length, and mapping), thus locally shifting the start and end positions of the final read mapping on the reference genome sequence.
This, in turn, affects which reference positions are effectively covered by a set of alternative alleles at the borders of a given locus (See Sneak preview: SNP calling artefacts above). We first describe the general principle based on single-end sequencing. Then we further describe how this principle affects paired-end merged sequences.

Polymorphisms at restriction enzyme sites affect GBS library construction

Polymorphisms affect the genomic positions at which adapters may be ligated

Restriction enzyme sites (RE) are positions where GBS-adapters are ligated, and mark the beginning (5’ end) of a read sequence. Polymorphisms (both SNPs and InDels) occuring at the restriction enzyme site may lead to loss or gain of REs in the genome of the sample under study, thus affecting the positions where adapters are ligated. The relative distance between two neighboring RE’s is important because only fragments in a narrow size range (typically 100-300 bp) are size-selected and PCR-amplified before sequencing. Depending on the GBS-protocol, size-selection may be performed through band excision after gel-electrophoresis and/or using restrictive elongation times during PCR-amplification. Thus, polymorphisms at REs lead to absence/presence of entire GBS fragments (NULL alleles), or may locally shift the start position of a read to a neighboring RE. The proportion of non-overlapping GBS loci in the sample set is proportional to the density of SNPs and InDels in the genome; species with higher genetic diversity contain less common GBS loci across sample sets.


Polymorphisms affect the effective sequenced region

InDels affect effective range covered in the reference sequence by reads with fixed read length

InDels affect which part of the reference sequence is effectively covered by a short read, “anchored” by a restriction enzyme site and of fixed length.

Deletion

../_images/deletion_scheme.png

An alternative allele with a deletion compared to the reference will not have to spend sequence length on the deleted region, thus allowing to sequence farther away from the RE.

Insertion

../_images/insertion_scheme.png

An alternative allele with an insertion compared to the reference will have to spend sequence length on that insertion, thus shortening the distance that can be sequenced away from the RE.

SNPs

As SNPs are nucleotide substitutions, they do not change the effective distance sequenced away from the RE.


Polymorphisms affect read mapping

Mismatches between read and reference affect the alignment itself, and thus the region of the reference that is covered after read mapping

The BWA-MEM algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW). Since sequence reads derived from a given allele at a given locus are identical (except from read errors), the BWA-MEM algorithm generates the same seed and performs the same alignment extension, thus creating exactly the same mapping for all reads derived from the same allele, leading to stacked read alignments per allele. Polymorphisms may affect the MEM - and thus the initial seed sequence - or stop the extension towards the respective ends of the read if SNPs or InDels interrupt further SW sequence alignment. Notwithstanding, the BWA-MEM alignment algorithm will produce the same mapping for all reads derived from a given allele, with alternative start and end positions compared to reference reads depending on the local distribution of SNPs and InDels.

Polymorphisms in the middle of a read

Typically, SNPs or InDels in the middle of the read do not strongly affect the start and end positions of the alignment, as long as minimal read-reference sequence similarity is maintained to support alignment extension outwards from the MEM.

Polymorphisms towards the ends of a read

Typically, SNPs and InDels closer to the respective ends of the read will result in soft clipping: the premature truncation of the alignment extension. Close to the end of a read, InDels may generate a too high gap penalty score, and high density of SNPs may generate a too high cumulative mismatch penalty, to be compensated for by positive scores of matching alignment after the gap or stretch of SNPs, thus leading to truncation of the alignment extension just prior to the start of the polymorphic region (see below).


SMAPs in separately mapped reads versus merged reads

In the two tabs below, we illustrate in detail how different types of polymorphisms occuring at various locations within a given locus affect the final read mapping positions. The effect on read mapping is different for separately mapped reads (obtained by single-end or paired-end sequencing), and for paired-end reads that are merged before read mapping.

These tabs display schematic overviews of the different reasons why polymorphisms (SNPs and Indels) give rise to alternative mapping positions of reads, compared to a reference read obtained by GBS and mapped as separate reads. We show the effects according to the three main steps in the GBS procedure:

  1. library construction: (gain of RE, loss of RE)

  2. short, fixed read length sequencing: (insertions and deletions)

  3. mapping: (soft clipping)

../_images/SMAP_scenario%27s_seq_align_A.png
Allele A: a gain of restriction enzyme site in the middle of the fragment.
This creates two new PCR-fragments that may both be sequenced (only one is indicated in the scheme).

If only one fragment is PCR-amplifiable, sequenced, and mapped, this creates a novel 3’ read mapping end and no problem occurs. Note that the novel restriction enzyme site (shown as shaded sequence (CTGCAG) on the 3’ of the read) is removed during read preprocessing. The sequence is indicated here to show how a single SNP can create a novel RE.

If both fragments are in the selected size range, but they map next to each other without overlap (image below) (the restriction enzyme site remnant is removed during read preprocessing), the reads are analyzed and counted separately, despite originating from the same chromosome molecule. This inflates the total read counts per locus and wrongfully alters haplotype frequencies during haplotype calling with SMAP haplotype-sites. To exclude this phenomenon from downstream analysis, by default, all StackClusters are removed where the lowest upstream SMAP of one Stack is found at a higher reference position than the highest downstream SMAP of an other Stack in the same StackCluster. In addition, MergedClusters are removed that were constructed by two StackClusters that do not overlap and an other StackCluster that links the non-overlapping pair (see also How it works).
../_images/SMAP_scenario%27s_seq_align_A_extra.png

Retaining only the central region of loci

In case the user does not prefer to use alternative read mapping (i.e. SMAPs) as a type of polymorphisms in the haplotype strings, the option --central_region in conjuction with options --min_central_region_length and --max_central_region_length, removes the mapping region polymorphisms on the ‘outside’ of the loci, and trims to the ‘inner’ SMAPs.

../_images/SMAP_delineate_Central_Region.png
This filter retains only the central region defined by the downstream SMAP at the locus start and the upstream SMAP at the locus end for a given MergedCluster. It may be used to select the central region of loci that is covered by all reads in all samples. In addition, loci may be selected within a defined central_region length range. This is instrumental for e.g. HiPlex primer design with the SMAP snp-seq module. In that case, primers are designed in the central region of a locus where all detected reads span the entire locus, whereas the flanking regions that displayed truncated read alignments suggest that neighboring (yet unknown) polymorphisms hampered complete read alignment. By focussing on the central region, all reads are known to span the entire locus in all detected samples (use also in combination with the --completeness filter), and all known SNPs can be taken into account to avoid primer binding sites on polymorphic regions, while targeting SNPs in the central region by SMAP snp-seq.