How It Works

The SMAP package was created to resolve a paradox in genotype calling using `Stacked´ Genotyping-By-Sequencing reads mapped onto a reference genome sequence.
The paradox is that GBS reads mapped onto a reference genome sequence are used to identify genetic polymorphisms (SNPs and/or InDels) as molecular markers, yet SNPs and InDels themselves define how and where reads map onto the reference genome sequence, and thus influence their own detection and genotype calling.
Read-backed haplotyping requires the definition of the start and end point of loci to bundle sets of polymorphic `Sites´ into a string of phased markers. Again, SNPs and InDels themselves define which regions on the reference genome sequence are covered by reads, and thus influence the start and end point of their own loci.

The SMAP package resolves this paradox by:

  1. recognizing that read mapping polymorphisms exist.

  2. systematically positioning read mapping polymorphisms in a coordinate system of Stack Mapping Anchor Points (SMAPs) that mark the start and end points of loci covered by reads. This step is called Delineation of loci, and is performed by SMAP delineate.

  3. using the read mapping polymorphism itself as a novel type of molecular marker for haplotype calling, based on the read-reference alignment and read-backed phasing.

Here, we will describe the first two steps (recognizing and systematically positioning SMAPs), while the third step (haplotype calling), is performed by the component SMAP haplotype-sites.
The main goal of SMAP delineate is to analyze GBS read mapping distributions across sample sets for:
  1. fast Quality Control of read preprocessing and mapping procedures before SNP calling.

  2. to select loci relevant for haplotyping. For this purpose, SMAP delineate creates a BED file with SMAP positions per locus that can subsequently be used for read-backed haplotyping using SMAP haplotype-sites.


Delineating Stacks in different types of GBS libraries

The procedure to delineate Stack Mapping Anchor Points (SMAPs) in GBS loci depends on the type of sequencing data (separately mapped reads (single-end, paired-end), or merged reads) and the type of GBS library (single-enzyme GBS or double-enzyme GBS).

Below, we explain the principles underlying SMAP delineate applied to these different GBS scenarios, and illustrate how to recognize and position read mapping polymorphisms. The section Feature description provides full details on the different reasons why polymorphisms (SNPs and Indels) give rise to alternative mapping positions of merged reads, compared to a reference read allele obtained by GBS.

../_images/SD_HIW_overview.png

In single-enzyme GBS, both ends of a given genomic fragment contain the same `sticky end´ (restriction digest overhang). As a consequence, the i5 and i7 Illumina adapters have equal chance to ligate to either end of the fragment. This means that individual copies of the same locus may become ligated in the forward or in the reverse orientation with respect to the i5 or i7 adapters and their corresponding reads may map either onto the + strand or the - strand of the reference genome sequence. This section explains further why the strandedness of the mapping is relevant for delineation of read mapping polymorphisms in separately mapped reads, but not in merged reads.
Separately mapped reads usually span the genomic fragment only partially; from one RE to somewhere internal to the GBS fragment (`open-ended´). In contrast, merged reads usually span the genomic fragment from RE to RE (`closed-ended´).

Step 1. Delineating Stacks

Stacks are defined by sets of reads with identical read mapping start and end positions

Procedure

For each BAM file, reads are filtered on mapping quality (MQ) score, and stacks are delineated by unique start and end positions of each mapped read, here called Stack Mapping Anchor Points (SMAPs). In a BAM file viewed with SAMtools view, these coordinates are listed per read in column 4 and 5. For each Stack, the number of reads with exact same start and end positions (SMAPs) is counted. Stacks are only retained with Stack read depth between user-defined minimum and maximum values.

Reads to Stacks:

../_images/Reads_to_Stacks.png


SMAP summary of command line options

Step 2. Delineating StackClusters

StackClusters are defined by the positional overlap between Stacks within a sample

Procedure

For each BAM file, StackClusters are created by positional overlap (BEDtools merge) between Stacks. For each StackCluster, the number of overlapping Stacks is counted, and the corresponding read depths are summed. Based on user-defined criteria, these StackClusters are then filtered.

The user can set a minimum StackCluster read depth to select loci with sufficient reads for SNP/haplotype calling, and a maximum StackCluster read depth to remove loci likely derived from repetitive sequences. For each Stack, the relative contribution to the total StackCluster read depth is evaluated (Stack Depth Fraction). Stacks are only included in a StackCluster if they constitute at least a user-defined minimum fraction of the total read count in the StackCluster. This option removes Stacks derived from low frequency reads (and corresponding SMAPs) spuriously overlapping with otherwise correct StackClusters. Different threshold values are used for individuals and pools, as low frequency (<5-10%) observations are likely to be noise in genotyping data of individuals, while low frequency observations are expected in Pool-Seq data of genetically diverse species. The user can set a maximum number of Stacks per StackCluster to select loci according to an expected level of genetic diversity. If a StackCluster contains more Stacks than the specified value, the entire StackCluster is discarded from that sample. It is recommended to run SMAP delineate several times on your data while changing this value, as it might uncover possible technical or biological errors. For example, for a diploid individual the majority of StackClusters are expected to contain only 2 Stacks. If many StackClusters contain 3 or 4 Stacks this may indicate contamination or mislabeling of the sample. The number of Stacks per StackCluster are visualized in sample.StackCluster.Stacks.histogram.png, see Example data analyses for examples from actual data.

Stacks to StackClusters:

../_images/Stacks_to_StackClusters.png


SMAP summary of command line options

Step 3. Delineating MergedClusters

MergedClusters are defined by the positional overlap between StackClusters across sample sets

Procedure

After processing all BAM files in the specified folder, SMAP delineate overlaps (BEDtools merge) all StackClusters across all samples, and defines a MergedCluster for each genomic locus that is covered by a StackCluster in at least one sample. For each MergedCluster, the number of overlapping StackClusters is counted and the corresponding read depths are summed. MergedClusters are only retained if they meet two criteria; the respective StackCluster was detected in a minimum percentage of samples across the sample set (completeness), and less than a maximum number of SMAPs per MergedCluster to remove excessively high polymorphic loci.

StackClusters to MergedClusters:

../_images/StackClusters_to_MergedClusters.png


SMAP summary of command line options