Scope & Usage

Scope

SMAP haplotype-sites: using polymorphic sites (SNPs, SVs, and/or SMAPs) for read-backed haplotyping

SMAP haplotype-sites reconstructs multi-allelic haplotypes based on a predefined set of polymorphisms at Single Nucleotide Polymorphisms (SNPs), breakpoints of Structural Variants (SVs) and/or Stack Mapping Anchor Points (SMAPs) through read-backed haplotyping.

SMAP haplotype-sites can be used for `stacked´ read data such as Genotyping-By-Sequencing (GBS) or highly multiplex amplicon sequencing (HiPlex), and for random fragmented (e.g. Shotgun Sequencing) read data.

../_images/SMAP_sites_introduction_scheme.png

SMAP haplotype-sites only requires this input:

a single BED file to define the start and end points of loci (loci created by SMAP delineate for GBS, amplicon regions for HiPlex, and Sliding frames for Shotgun sequencing).

a single VCF file containing bi-allelic SNPs obtained with third-party SNP calling software.

a set of indexed BAM files for all samples that need to be compared.

SMAP haplotype-sites performs read-backed haplotyping, per sample, per locus, per read, using positional information of read alignments and creates multi-allelic haplotypes from a short string of polymorphic sites (ShortHaps).
SMAP haplotype-sites takes a conservative approach, without any form of imputation or phase extension, and strictly considers SNPs and/or SMAPs within a read for read-backed haplotyping.
SMAP haplotype-sites filters out genotype calls of loci with low read counts, and low frequency haplotypes, to control for noise in the data.
SMAP haplotype-sites creates a multi-allelic genotype call matrix listing haplotype calls, per sample, per locus, across the sample set.
SMAP haplotype-sites always returns quantitative haplotype frequencies, useful for Pool-Seq data.
SMAP haplotype-sites can also create discrete haplotype calls (expressed as either dominant or dosage calls) for individual samples.
SMAP haplotype-sites plots the haplotype frequency distribution per sample.
SMAP haplotype-sites plots a histogram of the number of haplotypes per locus across the sample set to show the haplotype diversity.

Loci with sets of polymorphic sites

In the SMAP haplotype-sites workflow, the user first selects loci known to be covered by reads across the sample set. For HiPlex data, pairs of primers define locus positions. SNPs identified by third-party software that are located within these loci are combined into haplotypes, all other SNPs and all other non-polymorphic positions are excluded. For Shotgun data, dynamic sliding frames are used that bundle neighboring SNPs, based on a VCF file with known SNPs obtained by third-party software. For GBS data, read mapping polymorphisms (SMAPs, see SMAP delineate) define locus positions and may be combined with SNPs as molecular markers for haplotyping. (See for third-party SNP calling software: SAMtools, BEDtools, Freebayes, or GATK for individuals, or SNAPE-pooled for Pool-Seq data. See also Veeckman et al, 2019 for a comparison of methods).

Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_sites.png

SMAP haplotype-sites is run on a set of BAM files, using BED files with locus positions created by SMAP delineate, SMAP sliding-frames or SMAP design, and a VCF file with SNP variants obtained with third-party software. Genotype call tables created by SMAP haplotype-sites may further be analysed with SMAP grm. SMAP haplotype-sites works on GBS, HiPlex and Shotgun sequencing data.

Commands & options

The scheme below shows how SMAP haplotype-sites is integrated with preprocessing, read mapping, locus delineation, and SNP calling. For GBS data, loci are positioned with SMAP delineate.

Depending on the type of data (HiPlex, Shotgun, or GBS), a specific BED file must be created to define the start and end positions of loci.

Typical Primer3 output that needs to be converted to a BED file to delineate the loci for SMAP haplotype-sites.

Index	Seq ID	Count	Primer_type	Orientation	Start	Len	tm	GC%	Any compl	3’ compl	Seq	Prod Size	Seq Length	Included Length	Pair any compl	Pair 3’ compl
1	Chr1	1	Generic	FORWARD	2	16	58.72	56.25	5.00	0.00	ATTCTCCGGGGTCACT	72	29887145	29887145	6.00	3.00
2	Chr1	1	Generic	REVERSE	73	17	59.69	47.05	4.00	2.00	GTACACCGGTATTCTTC
3	Chr1	1	Generic	FORWARD	92	20	59.65	45.00	3.00	3.00	CCCAAAAATCCCAGTGACAT	83	29887145	29887145	3.00	1.00
4	Chr1	1	Generic	REVERSE	174	20	58.88	55.00	3.00	0.00	TGACAGTAGCCCAAGAGGTG
5	Chr1	1	Generic	FORWARD	294	20	60.01	60.00	4.00	0.00	GCTAGTGGGAGCTGAAGTGG	81	29887145	29887145	3.00	1.00
6	Chr1	1	Generic	REVERSE	374	20	60.28	50.00	4.00	2.00	TAGTGCTGGCAACGACCATA
7	Chr1	1	Generic	FORWARD	463	20	60.79	60.00	6.00	0.00	GCTGCAGGGTAAGGAGAGGT	84	29887145	29887145	5.00	1.00
8	Chr1	1	Generic	REVERSE	546	21	59.00	47.62	8.00	2.00	GGATATCCTTGTCGAACTCCA

The scheme below outlines the relative positions of primers and loci on the reference genome sequence.

../_images/coordinates_HiPlex_manual.png

For HiPlex data, the user needs to create a custom BED file listing the loci based on the primer binding sites. We recommend to keep primer sequences in HiPlex reads for mapping, but to define the region between the primers in the BED file used for SMAP haplotype-sites. This region is defined by the first nucleotide downstream of the forward primer binding site to the last nucleotide upstream of the reverse primer binding site.

The primer binding site coordinates (using GFF coordinate system for primers: start and end both 1-based) need to be transformed as follows:

BED	INPUT
Reference	reference sequence ID
Start	F-primer end position (F-primer end given as 1-based coordinate)
End	R-primer start position - 1 (R-primer start given as 1-based coordinate)
HiPlex_locus_name	reference_(F-primer end position + 1)_(R-primer start position - 1)
Mean_Read_Depth	.
Strand	+
SMAPs	(F-primer end position + 1), (R-primer start position - 1)
Completeness	.
nr_SMAPs	2
Name	HiPlex_Set1

The table below corresponds to the four loci defined by the Primer3 output shown above.

Reference	Start	End	HiPlex_locus_name	Mean_read_depth	Strand	SMAPs	Completeness	nr_SMAPs	Name
Chr1	17	56	Chr1:18-56_+	.	+	18,56	.	2	HiPlex_Set1
Chr1	111	164	Chr1:112-164_+	.	+	112,164	.	2	HiPlex_Set1
Chr1	313	354	Chr1:314-354_+	.	+	314,354	.	2	HiPlex_Set1
Chr1	482	525	Chr1:483-525_+	.	+	483,525	.	2	HiPlex_Set1

Consider the following read mapping and associated VCF file with several neighboring SNPs.

../_images/coordinates_Shotgun_SNPs_manual.png

The user needs to create a custom BED file listing the loci based on a VCF file with SNPs. Sliding frames are created starting from the first SNP in the sequence, We recommend to define 3bp Sliding frames with the central nucleotide at the junction and two flanking nucleotides as SMAPs in the BED file used for SMAP haplotype-sites. Each junction on both ends of a structural variant may be genotyped independently.

Reference	Start	End	HiPlex_locus_name	Mean_read_depth	Strand	SMAPs	Completeness	nr_SMAPs	Name
Chr1	16	32	Chr1:17-32_+	.	+	17,32	.	2	HiPlex_Set1
Chr1	39	56	Chr1:40-56_+	.	+	40,56	.	2	HiPlex_Set1
Chr1	107	108	Chr1:108-108_+	.	+	108,108	.	2	HiPlex_Set1

The SNP coordinates need to be transformed into sliding frames as follows:

BED	INPUT
Reference	reference sequence ID
Start	first SNP position in frame - offset - 1
End	last SNP position in frame + offset
Shotgun_locus_name	reference_start_end
Mean_Read_Depth	.
Strand	+
SMAPs	First SNP position - offset, last SNP position + Offset
Completeness	.
nr_SMAPs	2
Name	Shotgun_Set1

Consider the following read mapping structure and associated VCF file with structural variants.

../_images/coordinates_Shotgun_SV_manual.png

The user needs to create a custom BED file listing the loci based on a VCF file with known junctions of Stuctural Variants. We recommend to define 3bp Sliding frames with the central nucleotide at the junction and two flanking nucleotides as SMAPs in the BED file used for SMAP haplotype-sites. Each junction on both ends of a structural variant may be genotyped independently.

Reference	Start	End	HiPlex_locus_name	Mean_read_depth	Strand	SMAPs	Completeness	nr_SMAPs	Name
Chr1	42	45	Chr1:43-45_+	.	+	43,45	.	2	Shotgun_Set2
Chr1	193	196	Chr1:194-196_+	.	+	194,196	.	2	Shotgun_Set2
Chr1	10038	10041	Chr1:10039-10041_+	.	+	10039,10041	.	2	Shotgun_Set2

The SV coordinates need to be transformed to short Sliding frames as follows:

BED	INPUT
Reference	reference sequence ID
Start	SV position - 2
End	SV position + 1
Shotgun_locus_name	reference_(SV position - 1)_(SV position + 1)
Mean_Read_Depth	.
Strand	+
SMAPs	(SV position - 1), (SV position + 1)
Completeness	.
nr_SMAPs	2
Name	Shotgun_Set2

For GBS data, the user needs to run SMAP delineate on the same set of BAM files as will be used for haplotyping to create a BED file listing the loci with SMAPs. The read mapping profiles determine the locus start and end points and internal SMAPs.

Reference	Start	End	MergedCluster_name	Mean_read_depth	Strand	SMAPs	Completeness	nr_SMAPs	Name
scaffold_10030	15617	15711	scaffold_10030:15618-15711_+	1899	+	15618,15622,15703,15711	13	4	2n_ind_GBS_SE
scaffold_10030	15712	15798	“scaffold_10030:15713-15798_-”	1930	-	15713,15793,15798	9	3	2n_ind_GBS_SE

BED file entry listing all relevant features of two neighboring loci. On the + strand of the reference sequence, the start (15617) and end (15711) positions of the locus, together with the mean locus read depth (1899), the strand (+), the internal SMAP positions (15621, 15702), the number of samples with data at that locus (completeness, 13), the number of SMAPs (4), and a custom label that denotes the dataset (2n_ind_GBS_SE). The second entry lists the locus and SMAP positions on the (-) strand.

SMAP haplotype-sites reconstructs haplotypes based on SMAP positions and SNPs through read-backed haplotyping on a given set of BAM files.
SMAP haplotype-sites first creates sets of polymorphic positions per locus on the reference genome by intersecting locus regions (obtained with SMAP delineate) with a VCF file containing selected SNPs (obtained from any third-party SNP calling algorithm applied to the same set of BAM files).
Read-reference nucleotide pairs are retrieved by pysam ‘s get_aligned_pairs function, in which lower case nucleotides denote "different from the reference”.
In each BAM file, SMAP haplotype-sites then evaluates each read-reference alignment for the nucleotide aligned at the SMAP/SNP positions and scores as follows:

CALL TYPE	CLASSES
.	absence of read mapping
0	presence of the reference nucleotide
1	presence of an alternative nucleotide (any nucleotide different from the reference)
-	presence of a gap in the alignment

These calls are concatenated into a haplotype string of '.01-’s. For each discovered haplotype in the data, the total number of corresponding reads is counted per sample. Next, the haplotype counts of all samples are integrated into one master table, and expressed as relative haplotype frequency per locus per sample. Haplotypes with low frequency across all samples are removed to control for noise. The final table with haplotype frequencies per locus per sample is the end point for analysis of Pool-Seq data. Using the option --discrete_calls, SMAP haplotype-sites transforms the haplotype frequency table into discrete haplotype calls for individuals.

Three modes may be chosen for discrete haplotype calling in individuals:

CALL TYPE	CLASSES
dosage calls in diploids	0, 1, 2
dosage calls in triploids	0, 1, 2, 3
dosage calls in tetraploids	0, 1, 2, 3, 4
dosage calls in pentaploids	0, 1, 2, 3, 4, 5
dosage calls in pentaploids	0, 1, 2, 3, 4, 5, 6
dominant calls	0, 1

In the following sections, identification and quantification of haplotypes is illustrated on single-end GBS read data of a set of 8 diploid individuals at two partially overlapping loci. The content of the three example input files (BED, VCF, BAM) at this locus will be used to demonstrate the subsequent steps of SMAP haplotype-sites.

Mandatory options for SMAP haplotype-sites

The option -mapping_orientation must always be used to specify if strandedness of read mapping should be considered for haplotyping. -mapping_orientation stranded means that only reads will be considered that map on the same strand as indicated per locus in the SMAP BED file. -mapping_orientation ignore should be used to collect all reads per locus independent of the strand that the reads are mapped on (i.e. ignoring their mapping orientation). See the section on strandedness for more information.

use -mapping_orientation ignore for PE HiPlex reads that were merged before mapping (by e.g. PEAR).

use -mapping_orientation ignore for single-end, paired-end reads mapped separately, or reads that were merged before mapping (by e.g. PEAR).

use -mapping_orientation stranded for single-end or paired-end reads mapped separately.

use -mapping_orientation ignore for reads that were merged before mapping (by e.g. PEAR).

The option -partial must always be used to specify if reads are expected to be aligned at both outer positions of the locus (HiPlex, Shotgun SNPs in sliding frames) or if reads are expected to display read mapping polymorphisms along the locus (GBS, Shotgun SVs).
-partial exclude means that only reads are considered for haplotyping that completely cover the locus including both start and end points. Partially mapped reads are excluded. (see also HiPlex and Shotgun SNPs)
-partial include means that all reads are considered for haplotyping, including reads that only partially cover the locus. (see also GBS and Shotgun SVs)

Use -partial exclude for HiPlex because reads amplified with both primers are expected to cover the entire region between the primers. This is scored by being “present” in the read-reference aligned nucleotide pair on the two SMAP positions (just downstream of the forward primer, and just upstream of the reverse primer). Reads with partial alignments are considered amplification, sequencing, or read trimming artefacts, and are excluded from evaluation in the haplotype tables.

Use -partial exclude for Shotgun if sliding frames are used to haplotype sets of neighboring SNPs.

Use -partial include for Shotgun if sliding frames are used to haplotype the junctions of SVs.

Use -partial include for GBS because you must include reads with mapping position polymorphisms in the haplotype table.

General options

positional arguments:

alignments_dir ############# (str) ### Path to the directory containing BAM and BAI files. All BAM files should be in the same directory. Positional mandatory argument, should be the first argument after smap haplotype-sites [no default].
bed ##################### (str) ### Path to the BED file containing sites for which haplotypes will be reconstructed. For GBS experiments, the BED file should be generated using SMAP delineate. For HiPlex data, a BED6 file can be provided, with the 4th and 5th column being blank and the chromosome name, locus start position site, locus end position site and strand information populating the first, second, third and sixth column respectively. Positional mandatory argument, should be the second argument after smap haplotype-sites.
vcf ##################### (str) ### Path to the VCF file (in VCFv4.2 format) containing variant positions. It should contain at least the first 9 columns listing the SNP positions, sample-specific genotype calls across the sampleset are not required. Positional mandatory argument, should be the third argument after smap haplotype-sites.

optional arguments:

-p, --processes ########### (int) ### Number of parallel processes [1].
--plot ######################### Select which plots are to be generated. Choosing “nothing” disables plot generation. Passing “summary” only generates graphs with information for all samples while “all” will also enable generate per-sample plots [default “summary”].
-t, --plot_type ################## Use this option to choose plot format, choices are png and pdf [png].
-o, --out ############### (str) ### Basename of the output file without extension [SMAP_haplotype_sites].
-u, --undefined_representation ####### Value to use for non-existing or masked data [NaN].
-h, --help ##################### Show the full list of options. Disregards all other parameters.
-v, --version ################### Show the version. Disregards all other parameters.
--debug ######################## Enable verbose logging.

Optional arguments may be given in any order.

Filtering options

-q, --min_mapping_quality #### (int) ### Minimum .bam mapping quality to retain reads for analysis [30].
--no_indels ##################### Use this option if you want to exclude haplotypes that contain an InDel at the given SNP/SMAP positions. These reads are also ignored to evaluate the minimal read count [default off; indels are included in output].
-j, --min_distinct_haplotypes # (int) ### Minimal number of distinct haplotypes per locus across all samples. Loci that do not fit this criterium are removed from the final output [0].
-k, --max_distinct_haplotypes # (int) ### Maximal number of distinct haplotypes per locus across all samples. Loci that do not fit this criterium are removed from the final output [inf].
-c, --min_read_count ####### (int) ### Minimal total number of reads per locus per sample [0].
-d, --max_read_count ####### (int) ### Maximal number of reads per locus per sample, read count is calculated after filtering out the low frequency haplotypes (-f) [inf].
-f, --min_haplotype_frequency # (float) ## Set minimal HF (in %) to retain the haplotype in the genotyping matrix. Haplotypes above this threshold in at least one of the samples are retained. Haplotypes that never reach this threshold in any of the samples are removed [0].
-m, --mask_frequency ####### (float) ## Mask haplotype frequency values below this threshold for individual samples to remove noise from the final output. Haplotype frequency values below this threshold are set to -u. Haplotypes are not removed based on this value, use --min_haplotype_frequency for this purpose instead.

Optional arguments may be given in any order.

Options for discrete calling in individual samples

This option is primarily supported for diploids, triploids and tetraploids. Users can define their own custom frequency interval bounds for species with a higher ploidy, but this requires optimization based on the observed haplotype frequency distributions.

-e, –-discrete_calls ### (str) ### Set to “dominant” to transform haplotype frequency values into presence(1)/absence(0) calls per allele, or “dosage” to indicate the allele copy number.

-i, --frequency_interval_bounds ## Frequency interval bounds for classifying the read frequencies into discrete calls. Custom thresholds can be defined by passing one or more space-separated integers or floats which represent relative frequencies in percentage. For dominant calling, one value should be specified. For dosage calling, an even total number of four or more thresholds should be specified. Defaults are used by passing either “diploid”, “triploid”, or “tetraploid”. The default value for dominant calling (see discrete_calls argument) is 10, regardless if “diploid”, “triploid” or “tetraploid” is used. For dosage calling, the default for diploids is “10 10 90 90”, for triploids “12.5 12.5 50.0 50.0 87.5 87.5”, and for tetraploids “12.5 12.5 37.5 37.5 62.5 62.5 87.5 87.5”.

-z, --dosage_filter #### (int) ### Mask dosage calls in the loci for which the total allele count for a given locus at a given sample differs from the defined value. For example, in diploid organisms the total allele copy number must be 2, in triploid organisms the total dosage call must be 3, and in tetraploids the total allele copy number must be 4 [default no filtering].

--locus_correctness ###### (int) ### Threshold value: % of samples with locus correctness. Create a new .bed file defining only the loci that were correctly dosage called (-z) in at least the defined percentage of samples [default no filtering].

--cervus ################# ### Transform discrete genotype call table to a multi-allelic format that can be used as input for Cervus. Haplotypes are transformed to letters of the alphabet (a-z).

--frequency_interval_bounds in practical examples and additional information on the dosage filter can be found in the section on recommendations.

Example commands

HiPlex

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --min_read_count 30 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_pool_HiPlex_NI_NP

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_HiPlex_NI_NP_DOMdiplo --discrete_calls dominant --frequency_interval_bounds 10

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_HiPlex_NI_NP_DOSdiplo --discrete_calls dosage --frequency_interval_bounds 10 10 90 90 --dosage_filter 2

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --min_read_count 15 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_HiPlex_NI_NP_DOMtriplo --discrete_calls dominant --frequency_interval_bounds 10

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --min_read_count 15 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_HiPlex_NI_NP_DOStriplo --discrete_calls dosage --frequency_interval_bounds 12.5 12.5 50.0 50.0 87.5 87.5 --dosage_filter 3

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --discrete_calls dominant --frequency_interval_bounds 10 --min_read_count 20 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 4n_ind__NI_NP_DOMtetra

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial exclude --no_indels --discrete_calls dosage --frequency_interval_bounds 12.5 12.5 37.5 37.5 62.5 62.5 87.5 87.5 --dosage_filter 4 --min_read_count 20 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 4n_ind__NI_NP_DOStetra

Shotgun

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded -partial include --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_WGS_SE_NI_DOMdiplo --discrete_calls dominant --frequency_interval_bounds 10

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded -partial include --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_WGS_SE_NI_DOSdiplo --discrete_calls dosage --frequency_interval_bounds diploid

GBS

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded -partial include --no_indels --min_read_count 30 -f 2 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_pool_GBS_SE_NI

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial include --no_indels --min_read_count 30 -f 2 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_pools_GBS_PE_NI

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial include --no_indels --min_read_count 30 -f 2 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 4n_pools_GBS_PE_NI

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded -partial include --no_indels --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_GBS_SE_NI_DOSdiplo --discrete_calls dosage --frequency_interval_bounds 10 10 90 90 --dosage_filter 2

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial include --no_indels --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_GBS_PE_NI_DOMdiplo --discrete_calls dominant --frequency_interval_bounds 10

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial include --no_indels --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 2n_ind_GBS_PE_NI_DOSdiplo --discrete_calls dosage --frequency_interval_bounds 10 10 90 90 --dosage_filter 2

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial include --no_indels --discrete_calls dominant --frequency_interval_bounds 10 --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 4n_ind_GBS_PE_NI_DOMtetra

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation ignore -partial include --no_indels --discrete_calls dosage --frequency_interval_bounds 12.5 12.5 37.5 37.5 62.5 62.5 87.5 87.5 --dosage_filter 4 --min_read_count 10 -f 5 -p 8 --min_distinct_haplotypes 2 --plot_type png --plot all -o 4n_ind_GBS_PE_NI_DOStetra

Summary of Commands

A typical command line example :

smap haplotype-sites /path/to/BAM/ /path/to/BED/ /path/to/VCF/ -mapping_orientation stranded --no_indels -c 10 -f 5 -p 8 --plot_type png -partial include --min_distinct_haplotypes 2 -o haplotypes_SampleSet1

Output

Tabular output

By default, SMAP haplotype-sites will return two .tsv files.

haplotype counts

Read_counts_cx_fx_mx.tsv (with x the value per option used in the analysis) contains the read counts (-c) and haplotype frequency (-f) filtered and/or masked (-m) read counts per haplotype per locus as defined in the BED file from SMAP delineate. This is the file structure:

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1:100-200_+

00010

0

13

34

Chr1:100-200_+

01000

19

90

28

Chr1:100-200_+

00110

60

0

23

Chr1:450-600_+

0010

70

63

87

Chr1:450-600_+

0110

108

22

134

relative haplotype frequency

Haplotype_frequencies_cx_fx_mx.tsv contains the relative frequency per haplotype per locus in sample (based on the corresponding count table: Read_counts_cx_fx_mx.tsv). The transformation to relative frequency per locus-sample combination inherently normalizes for differences in total number of mapped reads across samples, and differences in amplification efficiency across loci. This is the file structure:

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1:100-200_+

00010

0

0.13

0.40

Chr1:100-200_+

01000

0.24

0.87

0.33

Chr1:100-200_+

00110

0.76

0

0.27

Chr1:450-600_+

0010

0.39

0.74

0.39

Chr1:450-600_+

0110

0.61

0.26

0.61

For individuals, if the option --discrete_calls is used, the program will return three additional .tsv files. Their content and order of creation is shown in this scheme.

haplotype total discrete calls

The first file is called haplotypes_cx_fx_mx_discrete_calls._total.tsv and this file contains the total dosage calls, obtained after transforming haplotype frequencies into discrete calls, using the defined --frequency_interval_bounds. The total sum of discrete dosage calls is expected to be 2 in diploids and 4 in tetraploids.

Locus

Sample1

Sample2

Sample..

Chr1:100-200_+

2

2

3

Chr1:450-600_+

2

2

2

haplotype discrete calls

The second file is haplotypes_cx_fx_mx-discrete_calls_filtered.tsv, which lists the discrete calls per locus per sample after --dosage_filter has removed loci per sample with an unexpected number of haplotype calls (as listed in haplotypes_cx_fx_mx_discrete_calls_total.tsv). The expected number of calls is set with option -z [use 2 for diploids, 3 for triploids, 4 for tetraploids].

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1:100-200_+

00010

0

1

NA

Chr1:100-200_+

01000

1

1

NA

Chr1:100-200_+

00110

1

0

NA

Chr1:450-600_+

0010

1

1

1

Chr1:450-600_+

0110

1

1

1

population haplotype frequencies

The third file, haplotypes_cx_fx_mx_Pop_HF.tsv, lists the population haplotype frequencies (over all individual samples) based on the total number of discrete haplotype calls relative to the total number of calls per locus.

Locus

Haplotypes

Pop_HF

count

Chr1:100-200_+

00010

25.0

4

Chr1:100-200_+

01000

50.0

4

Chr1:100-200_+

00110

25.0

4

Chr1:450-600_+

0010

50.0

6

Chr1:450-600_+

0110

50.0

6

For individuals, if the option --locus_correctness is used in combination with --discrete_calls and --frequency_interval_bounds, the programm will create a new .bed file haplotypes_cx_fx_mx_correctnessx_loci.bed (loci filtered from the input .bed file) containing only the loci that were correctly dosage called (-z) in at least the defined percentage of samples. See above.

Loci with correct calls across the sample set

Reference

Start

End

HiPlex_locus_name

Mean_read_depth

Strand

SMAPs

Completeness

nr_SMAPs

Name

Chr1

99

200

Chr1:100-200_+

.

+

100,200

.

2

HiPlex_Set1

Chr1

449

600

Chr1:450-600_+

.

+

450,600

.

2

HiPlex_Set1

Graphical output

haplotype diversity

By default, SMAP haplotype-sites will generate graphical output summarizing haplotype diversity. haplotype_counts_discrete_calls_filtered.barplot.png shows a histogram of the number of distinct haplotypes per locus across all samples.

haplotype frequency distribution per sample

Graphical output of the haplotype frequency distribution for each individual sample can be switched on using the option --plot_all. sample_haplotype_frequency_distribution.png shows the haplotype frequency distribution across all loci detected per sample. It is the graphical representation of each sample-specific column in haplotypes_cx_fx_mx.tsv. Using the option --discrete_calls, this plot will also show the defined discrete calling boundaries.

quality of genotype calls per locus and per sample (only for individuals)

After discrete genotype calling with option --discrete_calls, SMAP haplotype-sites will evaluate the observed sum of discrete dosage calls per locus per sample versus the expected value per locus (set with option -z, recommended use: 2 for diploid, 4 for tetraploid).

The quality of genotype calls per sample is calculated in two ways: the fraction of loci with calls in that sample versus the total number of loci across all samples (sample_call_completeness); the fraction of loci with expected sum of discrete dosage calls (-z) versus the total number of observed loci in that sample (sample_call_correctness.tsv). These scores are calculated separately per sample, and SMAP haplotype-sites plots the distribution of those scores across the sample set (sample_call_completeness.png; sample_call_correctness.png).

Similarly, the quality of genotype calls per locus is calculated in two ways: the fraction of samples with calls for that locus versus the total number of samples (locus_call_completeness); the fraction of samples with expected sum of discrete dosage calls (-z) versus the total number of observed samples for that locus (locus_call_correctness.tsv). These scores are calculated separately per locus, and SMAP haplotype-sites plots the distribution of those scores across the locus set (locus_call_completeness.png; locus_call_correctness.png).

Both graphs and the corresponding tables (one for samples and one for loci) can be evaluated to identify poorly performing samples and/or loci. We recommend to eliminate these from further analysis by removing BAM files from the run directory and/or loci from the SMAP delineate BED file with SMAPs, and iterate through rounds of data analysis combined with sample and locus quality control.

../_images/sample_call_completeness_correctness_40canephora.png

The sample call completeness plot shows the percentage of loci that have data across the samples after all filters. In read depth-saturated, low diversity datasets, the majority of samples should have high locus completeness and there should not be much variation in completeness between samples. In a high diversity or read depth-unsaturated sample set, locus completeness per sample will be lower and more spread out.

The sample call correctness plot displays the percentage of correctly dosage called (-z) loci across the sampleset. Loci are only masked in samples with a dosage value different from -z but remain in the data set for all other samples with the expected dosage value.

../_images/locus_call_completeness_correctness_40canephora.png

The locus call completeness plot displays the percentage of samples that have data (after every filter) on a locus for every locus. In read depth-saturated, low diversity sample sets, the majority of samples should have many high completeness loci and few low completeness loci. In a high diversity or read depth-unsaturated sample set, many loci will have a low completeness.

The locus call correctness plot shows the percentage of samples that were correctly dosage called (-z) across the locus set. Loci with low correctness values indicate potential genotype calling artefacts and should be removed from the data set.

##fileformat=VCFv4.2
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
scaffold_10030	15623	.	G	T	68888.7	.	.	GT
scaffold_10030	15650	.	C	T	1097.13	.	.	GT
scaffold_10030	15655	.	A	T	1097.13	.	.	GT
scaffold_10030	15682	.	C	G	1097.13	.	.	GT
scaffold_10030	15689	.	T	C	1097.13	.	.	GT
scaffold_10030	15700	.	A	C	1097.13	.	.	GT
scaffold_10030	15704	.	G	T	1097.13	.	.	GT
scaffold_10030	15705	.	A	C	1097.13	.	.	GT
scaffold_10030	15733	.	C	T	45538.80	.	.	GT
scaffold_10030	15753	.	G	C	44581.50	.	.	GT
scaffold_10030	15769	.	C	A	64858.50	.	.	GT
scaffold_10030	15787	.	A	C	67454.00	.	.	GT
scaffold_10030	15796	.	A	C	45281.60	.	.	GT