Scope & Usage

Scope

SMAP haplotype-sites: using polymorphic sites (SNPs, SVs, and/or SMAPs) for read-backed haplotyping

SMAP haplotype-sites reconstructs multi-allelic haplotypes based on a predefined set of polymorphisms at Single Nucleotide Polymorphisms (SNPs), breakpoints of Structural Variants (SVs) and/or Stack Mapping Anchor Points (SMAPs) through read-backed haplotyping.
SMAP haplotype-sites can be used for `stacked´ read data such as Genotyping-By-Sequencing (GBS) or highly multiplex amplicon sequencing (HiPlex), and for random fragmented (e.g. Shotgun Sequencing) read data.
../_images/SMAP_sites_introduction_scheme.png

SMAP haplotype-sites only requires this input:

  1. a single BED file to define the start and end points of loci (loci created by SMAP delineate for GBS, amplicon regions for HiPlex, and Sliding frames for Shotgun sequencing).

  2. a single VCF file containing bi-allelic SNPs obtained with third-party SNP calling software.

  3. a set of indexed BAM files for all samples that need to be compared.

SMAP haplotype-sites performs read-backed haplotyping, per sample, per locus, per read, using positional information of read alignments and creates multi-allelic haplotypes from a short string of polymorphic sites (ShortHaps).
SMAP haplotype-sites takes a conservative approach, without any form of imputation or phase extension, and strictly considers SNPs and/or SMAPs within a read for read-backed haplotyping.
SMAP haplotype-sites filters out genotype calls of loci with low read counts, and low frequency haplotypes, to control for noise in the data.
SMAP haplotype-sites creates a multi-allelic genotype call matrix listing haplotype calls, per sample, per locus, across the sample set.
SMAP haplotype-sites always returns quantitative haplotype frequencies, useful for Pool-Seq data.
SMAP haplotype-sites can also create discrete haplotype calls (expressed as either dominant or dosage calls) for individual samples.
SMAP haplotype-sites plots the haplotype frequency distribution per sample.
SMAP haplotype-sites plots a histogram of the number of haplotypes per locus across the sample set to show the haplotype diversity.

Loci with sets of polymorphic sites

In the SMAP haplotype-sites workflow, the user first selects loci known to be covered by reads across the sample set. For HiPlex data, pairs of primers define locus positions. SNPs identified by third-party software that are located within these loci are combined into haplotypes, all other SNPs and all other non-polymorphic positions are excluded. For Shotgun data, dynamic sliding frames are used that bundle neighboring SNPs, based on a VCF file with known SNPs obtained by third-party software. For GBS data, read mapping polymorphisms (SMAPs, see SMAP delineate) define locus positions and may be combined with SNPs as molecular markers for haplotyping. (See for third-party SNP calling software: SAMtools, BEDtools, Freebayes, or GATK for individuals, or SNAPE-pooled for Pool-Seq data. See also Veeckman et al, 2019 for a comparison of methods).

Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_sites.png

SMAP haplotype-sites is run on a set of BAM files, using BED files with locus positions created by SMAP delineate, SMAP sliding-frames or SMAP design, and a VCF file with SNP variants obtained with third-party software. Genotype call tables created by SMAP haplotype-sites may further be analysed with SMAP grm. SMAP haplotype-sites works on GBS, HiPlex and Shotgun sequencing data.

Commands & options

The scheme below shows how SMAP haplotype-sites is integrated with preprocessing, read mapping, locus delineation, and SNP calling. For GBS data, loci are positioned with SMAP delineate.
../_images/NatMeth_Fig1b.png

Mandatory options for SMAP haplotype-sites

The option -mapping_orientation must always be used to specify if strandedness of read mapping should be considered for haplotyping. -mapping_orientation stranded means that only reads will be considered that map on the same strand as indicated per locus in the SMAP BED file. -mapping_orientation ignore should be used to collect all reads per locus independent of the strand that the reads are mapped on (i.e. ignoring their mapping orientation). See the section on strandedness for more information.
use -mapping_orientation ignore for PE HiPlex reads that were merged before mapping (by e.g. PEAR).
The option -partial must always be used to specify if reads are expected to be aligned at both outer positions of the locus (HiPlex, Shotgun SNPs in sliding frames) or if reads are expected to display read mapping polymorphisms along the locus (GBS, Shotgun SVs).
-partial exclude means that only reads are considered for haplotyping that completely cover the locus including both start and end points. Partially mapped reads are excluded. (see also HiPlex and Shotgun SNPs)
-partial include means that all reads are considered for haplotyping, including reads that only partially cover the locus. (see also GBS and Shotgun SVs)
Use -partial exclude for HiPlex because reads amplified with both primers are expected to cover the entire region between the primers. This is scored by being “present” in the read-reference aligned nucleotide pair on the two SMAP positions (just downstream of the forward primer, and just upstream of the reverse primer). Reads with partial alignments are considered amplification, sequencing, or read trimming artefacts, and are excluded from evaluation in the haplotype tables.

General options

positional arguments:

alignments_dir ############# (str) ### Path to the directory containing BAM and BAI files. All BAM files should be in the same directory. Positional mandatory argument, should be the first argument after smap haplotype-sites [no default].
bed ##################### (str) ### Path to the BED file containing sites for which haplotypes will be reconstructed. For GBS experiments, the BED file should be generated using SMAP delineate. For HiPlex data, a BED6 file can be provided, with the 4th and 5th column being blank and the chromosome name, locus start position site, locus end position site and strand information populating the first, second, third and sixth column respectively. Positional mandatory argument, should be the second argument after smap haplotype-sites.
vcf ##################### (str) ### Path to the VCF file (in VCFv4.2 format) containing variant positions. It should contain at least the first 9 columns listing the SNP positions, sample-specific genotype calls across the sampleset are not required. Positional mandatory argument, should be the third argument after smap haplotype-sites.
optional arguments:

-p, --processes ########### (int) ### Number of parallel processes [1].
--plot ######################### Select which plots are to be generated. Choosing “nothing” disables plot generation. Passing “summary” only generates graphs with information for all samples while “all” will also enable generate per-sample plots [default “summary”].
-t, --plot_type ################## Use this option to choose plot format, choices are png and pdf [png].
-o, --out ############### (str) ### Basename of the output file without extension [SMAP_haplotype_sites].
-u, --undefined_representation ####### Value to use for non-existing or masked data [NaN].
-h, --help ##################### Show the full list of options. Disregards all other parameters.
-v, --version ################### Show the version. Disregards all other parameters.
--debug ######################## Enable verbose logging.

Optional arguments may be given in any order.

Filtering options

-q, --min_mapping_quality #### (int) ### Minimum .bam mapping quality to retain reads for analysis [30].
--no_indels ##################### Use this option if you want to exclude haplotypes that contain an InDel at the given SNP/SMAP positions. These reads are also ignored to evaluate the minimal read count [default off; indels are included in output].
-j, --min_distinct_haplotypes # (int) ### Minimal number of distinct haplotypes per locus across all samples. Loci that do not fit this criterium are removed from the final output [0].
-k, --max_distinct_haplotypes # (int) ### Maximal number of distinct haplotypes per locus across all samples. Loci that do not fit this criterium are removed from the final output [inf].
-c, --min_read_count ####### (int) ### Minimal total number of reads per locus per sample [0].
-d, --max_read_count ####### (int) ### Maximal number of reads per locus per sample, read count is calculated after filtering out the low frequency haplotypes (-f) [inf].
-f, --min_haplotype_frequency # (float) ## Set minimal HF (in %) to retain the haplotype in the genotyping matrix. Haplotypes above this threshold in at least one of the samples are retained. Haplotypes that never reach this threshold in any of the samples are removed [0].
-m, --mask_frequency ####### (float) ## Mask haplotype frequency values below this threshold for individual samples to remove noise from the final output. Haplotype frequency values below this threshold are set to -u. Haplotypes are not removed based on this value, use --min_haplotype_frequency for this purpose instead.

Optional arguments may be given in any order.

Options for discrete calling in individual samples

This option is primarily supported for diploids, triploids and tetraploids. Users can define their own custom frequency interval bounds for species with a higher ploidy, but this requires optimization based on the observed haplotype frequency distributions.

-e, –-discrete_calls ### (str) ### Set to “dominant” to transform haplotype frequency values into presence(1)/absence(0) calls per allele, or “dosage” to indicate the allele copy number.

-i, --frequency_interval_bounds ## Frequency interval bounds for classifying the read frequencies into discrete calls. Custom thresholds can be defined by passing one or more space-separated integers or floats which represent relative frequencies in percentage. For dominant calling, one value should be specified. For dosage calling, an even total number of four or more thresholds should be specified. Defaults are used by passing either “diploid”, “triploid”, or “tetraploid”. The default value for dominant calling (see discrete_calls argument) is 10, regardless if “diploid”, “triploid” or “tetraploid” is used. For dosage calling, the default for diploids is “10 10 90 90”, for triploids “12.5 12.5 50.0 50.0 87.5 87.5”, and for tetraploids “12.5 12.5 37.5 37.5 62.5 62.5 87.5 87.5”.

-z, --dosage_filter #### (int) ### Mask dosage calls in the loci for which the total allele count for a given locus at a given sample differs from the defined value. For example, in diploid organisms the total allele copy number must be 2, in triploid organisms the total dosage call must be 3, and in tetraploids the total allele copy number must be 4 [default no filtering].

--locus_correctness ###### (int) ### Threshold value: % of samples with locus correctness. Create a new .bed file defining only the loci that were correctly dosage called (-z) in at least the defined percentage of samples [default no filtering].

--cervus ################# ### Transform discrete genotype call table to a multi-allelic format that can be used as input for Cervus. Haplotypes are transformed to letters of the alphabet (a-z).

--frequency_interval_bounds in practical examples and additional information on the dosage filter can be found in the section on recommendations.


Example commands

HiPlex

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --min_read_count 30 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_pool_HiPlex_NI_NP

Shotgun

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded -partial include --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_WGS_SE_NI_DOMdiplo --discrete_calls dominant --frequency_interval_bounds 10

GBS

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded -partial include --no_indels --min_read_count 30 -f 2 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_pool_GBS_SE_NI

Summary of Commands

A typical command line example :

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded --no_indels -c 10 -f 5 -p 8 --plot_type png -partial include --min_distinct_haplotypes 2 -o haplotypes_SampleSet1

Output

Tabular output

By default, SMAP haplotype-sites will return two .tsv files.

haplotype counts

Read_counts_cx_fx_mx.tsv (with x the value per option used in the analysis) contains the read counts (-c) and haplotype frequency (-f) filtered and/or masked (-m) read counts per haplotype per locus as defined in the BED file from SMAP delineate. This is the file structure:

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1:100-200_+

00010

0

13

34

Chr1:100-200_+

01000

19

90

28

Chr1:100-200_+

00110

60

0

23

Chr1:450-600_+

0010

70

63

87

Chr1:450-600_+

0110

108

22

134

relative haplotype frequency

Haplotype_frequencies_cx_fx_mx.tsv contains the relative frequency per haplotype per locus in sample (based on the corresponding count table: Read_counts_cx_fx_mx.tsv). The transformation to relative frequency per locus-sample combination inherently normalizes for differences in total number of mapped reads across samples, and differences in amplification efficiency across loci. This is the file structure:

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1:100-200_+

00010

0

0.13

0.40

Chr1:100-200_+

01000

0.24

0.87

0.33

Chr1:100-200_+

00110

0.76

0

0.27

Chr1:450-600_+

0010

0.39

0.74

0.39

Chr1:450-600_+

0110

0.61

0.26

0.61

Graphical output

haplotype diversity

By default, SMAP haplotype-sites will generate graphical output summarizing haplotype diversity. haplotype_counts_discrete_calls_filtered.barplot.png shows a histogram of the number of distinct haplotypes per locus across all samples.

haplotype frequency distribution per sample

Graphical output of the haplotype frequency distribution for each individual sample can be switched on using the option --plot_all. sample_haplotype_frequency_distribution.png shows the haplotype frequency distribution across all loci detected per sample. It is the graphical representation of each sample-specific column in haplotypes_cx_fx_mx.tsv. Using the option --discrete_calls, this plot will also show the defined discrete calling boundaries.

quality of genotype calls per locus and per sample (only for individuals)

After discrete genotype calling with option --discrete_calls, SMAP haplotype-sites will evaluate the observed sum of discrete dosage calls per locus per sample versus the expected value per locus (set with option -z, recommended use: 2 for diploid, 4 for tetraploid).

The quality of genotype calls per sample is calculated in two ways: the fraction of loci with calls in that sample versus the total number of loci across all samples (sample_call_completeness); the fraction of loci with expected sum of discrete dosage calls (-z) versus the total number of observed loci in that sample (sample_call_correctness.tsv). These scores are calculated separately per sample, and SMAP haplotype-sites plots the distribution of those scores across the sample set (sample_call_completeness.png; sample_call_correctness.png).

Similarly, the quality of genotype calls per locus is calculated in two ways: the fraction of samples with calls for that locus versus the total number of samples (locus_call_completeness); the fraction of samples with expected sum of discrete dosage calls (-z) versus the total number of observed samples for that locus (locus_call_correctness.tsv). These scores are calculated separately per locus, and SMAP haplotype-sites plots the distribution of those scores across the locus set (locus_call_completeness.png; locus_call_correctness.png).

Both graphs and the corresponding tables (one for samples and one for loci) can be evaluated to identify poorly performing samples and/or loci. We recommend to eliminate these from further analysis by removing BAM files from the run directory and/or loci from the SMAP delineate BED file with SMAPs, and iterate through rounds of data analysis combined with sample and locus quality control.