How It Works

SMAP grm creates pairwise comparisons of samples, and checks the overlap of haplotypes per locus, across all loci with data in both samples (here named “shared loci”). A haplotype is unique in a given sample, if that haplotype is not observed in the other sample of a pair. A locus is counted as ‘unique’ if it contains at least one haplotype that are unique for one of the two samples.

Sample selection, sample names, and sample order

By default, genetic similarities are calculated between all pairs of samples. Sample names in each output file correspond to the names in the haplotypes table. The order of the sample names in the matrices correspond to the order in the haplotype call table. However, a custom list containing a subset of sample names can be parsed as a tab-delimited text file to the script via the --samples option to limit the analyses to a subset of samples. The order of the sample names in the sample_list.txt file defines the order of the samples in all created matrixes: first sample = top (vertical axis) and left (horizontal axis) position, last sample = bottom (vertical axis) and right (horizontal axis) position. Sample names must exactly match to the sample names in the haplotype call table. Users can also choose to rename (some) samples by adding a second column to the samples_list.txt file with new sample names. A sample is not renamed if the cell in the second column of the corresponding row is left blank. Two or more samples are merged into one sample if the corresponding cells in the second column contain the same new sample name. The values of the merged sample used for all pairwise calculations are the sum of the discrete haplotype call or haplotype frequencies of the separate samples. Check out the required input files for examples of sample lists.

Locus selection

By default, genetic similarities are calculated on all loci with data in both samples of a pair. Genetic similarity calculations can be limited to a subset of loci (across all samples) by parsing a tab-delimited text file to the script containing a list of locus IDs via the option --loci locus_list.txt. All other loci in the haplotype call table are ignored. Locus IDs must exactly match to the IDs in the haplotype call table. Check out the required input files for examples of locus lists.

Filtering for locus completeness

Consider this simple example of a haplotype call table containing calls from three loci (A, B, and C) in three samples (Sample1, Sample2, and Sample3). Each locus has two alternative haplotypes.

Reference

Locus

Haplotypes

Sample1

Sample2

Sample3

Chrom1

A

00

0

0

1

Chrom1

A

10

1

1

0

Chrom1

B

000

1

Chrom1

B

100

0

Chrom1

C

00

1

1

Chrom1

C

11

0

1

Prior to the genetic similarity calculations, the haplotype call table can be filtered for locus completeness. The locus completeness is the minimum proportion of samples with no missing data for at least one haplotype. Loci with a lower proportion of samples with data are excluded from all downstream calculations. For example, locus A in Table 1 has haplotype calls in three samples (1, 2, and 3), locus B has calls in one sample (1), and locus C has calls in two samples (1 and 3). A value for the locus completeness of 0.3 would keep all samples for downstream calculations, whereas a locus completeness of 0.5 would remove locus 2 from the haplotype call table. The threshold value of the locus completeness is preferably chosen in function of the total number of samples and the overall data completeness in the haplotype call table.

Filtering for sample completeness

After the selection of loci with a minimum proportion of samples with haplotype calls, a sample completeness filter can be applied with option -sc. The sample completeness is the minimum number of loci with haplotype calls in each sample. Samples with a lower number of loci with data are discarded. For example, Sample1 in Table 1 has three loci with haplotype calls (A, B, C), Sample2 has one locus with haplotype calls (A), and Sample3 has two loci with haplotype calls (A and B). Using option -sc 1 would not exclude any samples, but -sc 2 would discard Sample2.

Plot line curves

To obtain a fair estimate of an appropriate sample completeness, the user can conduct a first run of the script with a sample completeness of zero (default) and create line curves of the genetic similarity in function of the number of loci using the --plot_line_curves option. These curves show intermediate values for the genetic similarity calculated after every x loci (with x the user-defined locus interval, default = 10). Genetic similarity estimates usually fluctuate in the first part of the line curve, but stabilize once a substantial number of loci is included in the calculations. A further increase in the number of loci may not result in strong fluctuations in the genetic similarity estimate. Consequently, the number of loci that correspond to a stable genetic similarity value in most samples might be an appropriate value for the sample completeness –sc . Genetic similarity values of sample pairs with a lower number of loci may not be reliable, because the estimate may change considerably if the number of loci in the calculation would be increased. By default, line curves are created for all sample pairs in the haplotype call table. The order of the sample pairs corresponds to the order in the (renamed/rearranged) haplotype call table (sample_list.txt), and the line curves of all pairs with the same first sample (i.e. the reference sample) will be combined into the same plot. However, the user can choose to plot line curves for only a subset of sample pairs. The list of all pairs of which line curves must be created (first column = first/reference sample, second column = second sample) can be parsed to the script via the --list_line_curves option.

Estimate pairwise genetic similarities and distances

Genetic similarity is expressed as the following coefficients:
Jaccard = a / (a + b + c) (Jaccard, 1912)
Sorensen-Dice = 2a / (2a + b + c) (Dice, 1945; Sørensen, 1948)
Ochiai = a / sqrt((a + b) * (a + c)) (Ochiai, 1957)
With a the number of shared haplotypes between ind1 and ind2, b the number of unique haplotypes in ind1, and c the number of unique haplotypes in ind2. The copy number of shared and unique haplotypes is not taken into account in a haplotype table consisting of discrete dosage calls. For example, if the copy number of a haplotype in both members of a sample pair is 2, the number of shared haplotypes is set to 1 (instead of 2). As such, the Jaccard genetic similarity values do not depend on the use of discrete dominant calls or discrete dosage calls.
The genetic similarity between a pair of samples is calculated by default based on all shared loci in a sample pair, which are loci with no missing data (=at least one haplotype). The haplotype(s) with data does not have to be the same in both samples. Missing data in shared loci is converted into a haplotype call/frequency of zero. Non-shared loci (i.e. loci with missing data in minimum one out of two samples) can be included in the calculations by using the --include_non_shared_loci option. However, we recommend to only include non-shared loci in genetic similarity calculations if the information content of the set of shared loci is insufficiently high and the absence of data in a sample likely has a biological explanation (e.g. restriction site polymorphisms in highly saturated GBS data). Locus A in Table 1 is an example of a shared locus in every pairwise comparison, because all three samples have haplotype calls for locus A. Locus B is an example of a non-shared locus in every pairwise comparison, because only Sample1 has haplotype calls for that locus (data for Sample2 and Sample3 are missing). Consequently, no sample pair exists with haplotype calls from locus B in both members of the pair. Locus C is a non-shared locus in pair Sample1 vs Sample2 and pair Sample2 vs Sample3 because Sample2 has no haplotype calls for locus C, but it is a shared locus in pair Sample1 vs Sample3 because both samples in this pair have haplotype calls for that locus.

Pairwise similarity estimates can be converted into genetic distances using the inversed or the Euclidean transformation method:
Inversed = 1 - similarity
Euclidean = sqrt((1 - similarity)^2)

Calculate the locus information content

In addition to the pairwise genetic similarity, the number of informative loci can also be calculated for each pair of samples. The user can choose the criterion that is used to define the information content of a locus via the option --locus_information_criterion:

shared: a locus with only shared haplotypes in a pair of samples (default, e.g. locus A for Sample1 vs Sample2 in Table 1)
unique: a locus with only unique haplotypes in a pair of samples (e.g. locus A for Sample1 vs Sample3 in Table 1)

The definition of a locus with shared/unique haplotypes can be broadened by using the --partial option so that all loci with at least one shared/unique haplotype, respectively, are taken into account. In Table 1 for instance, locus C is a partially informative locus in the pair Sample1 vs Sample3 because these samples have one shared haplotype (i.e. haplotype 00) and one haplotype unique to Sample3 (i.e. haplotype 10) on locus C. The script also calculates the number of sample pairs for which a locus is informative (contains discriminatory haplotypes). These numbers can be useful if subsets of loci must be created for downstream data analyses (like diagnostic markers to identify individuals). If an informative locus is defined based on unique haplotypes, then the sample names with only unique haplotypes in all sample pairs (--partial = FALSE) or with a unique combination of haplotypes (--partial = TRUE) are also identified for each locus.

References

Dice, L. R. (1945). Measures of the Amount of Ecological Association between Species. Ecology, 26(3), 297-302. https://doi.org/10.2307/1932409
Huson, D. H. & Bryant, D. (2006). Application of Phylogenetic Networks in Evolutionary Studies. Molecular Biology and Evolution, 23(2), 254-267. https://doi.org/10.1093/molbev/msj030
Jaccard, P. (1912). The distribution of the flora in the Alpine zone. New Phytologist, 11(2), 37-50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
Ochiai, A. (1980). Zoogeographical Studies on the Soleoid Fishes Found in Japan and Its Neighbouring Regions - II. Bulletin of the Japanese Society of Scientific Fisheries, 22(9), 526–530. https://doi.org/10.15080/agcjchikyukagaku.58.4_256
Paradis, E. & Schliep, K. 2018. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35(3): 526-528. https://doi.org/10.1093/bioinformatics/bty633
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/
Sørensen, T. (1948). A Method of Establishing Groups of Equal Amplitude in Plant Sociology based on Similarity of Species Content and its Application to Analyses of the Vegetation on Danish Commons. Kongelige Danske Videnskabernes Selskab, 5(4), 1–34.