Feature Description

Setting the stage

SMAP grm’s key function is to analyze pairwise genetic similarity, or conversely, the loci that discriminate between a set of samples based on a genotype table.

Haplotype call table

SMAP grm specifically works on the haplotype call table created by SMAP haplotype-sites or SMAP haplotype-window. Haplotype calls in the table can be discrete (e.g. 0, 1 or 2) for individuals, or haplotype frequencies (value from 0 to 1) for pool-Seq data. This haplotype call table lists the absence and presence of haplotypes as multi-allelic markers for a given set of loci, across a given set of samples. The haplotype call table is the only input required to run the script.

Genetic distance

Genetic distance may be calculated in different ways, but is essentially based on the proportion of shared haplotypes over all observed haplotypes. A pair of samples that contain only the same haplotypes (across all loci), have a distance of 0 and a similarity of 1. A pair of samples that each contain unique haplotypes that the other sample does not contain (across all loci), have a distance of 1 and a similarity of 0. Clearly, genetic distance can only be calculated if samples both have data at the same sets of loci, otherwise their haplotypes can not be compared.

The difference between “Shared loci” and “Shared or Unique haplotypes”

It is important to understand the clear distinction between these definitions:

Shared loci: Loci that have a call of at least one haplotype in each of two samples in a pair are called ‘shared loci’ (loci with data in a sample pair). Loci with missing data are excluded. Given a pair of samples with a shared locus, haplotypes can be compared.

Unique haplotype: a haplotype that occurs in one sample but not in the other, is defined as a “Unique haplotype”. Shared haplotype: a haplotype that occurs in both samples of a pair, is defined as a “Shared haplotype”.

Note that Unique and Shared haplotypes can only be evaluated in shared loci (with data in both samples). Then, there are two ways in which similarities and/or discriminatory loci are calculated, based either on Shared or Unique haplotypes directly (like Jaccard Distance), or by aggregating on “loci that contain Shared or Unique haplotypes”.

To clarify the concepts and definitions, check out the scheme and explanation below.

Reference

Locus

Haplotypes

Sample1

Sample2

Sample3

Chrom1

A

00

0

0

1

Chrom1

A

10

1

1

0

Chrom1

B

000

1

Chrom1

B

100

0

Chrom1

C

00

1

1

Chrom1

C

11

0

1

Shared loci:

  • locusA: is a Shared locus in all pairs of samples.

  • locusB: is never a Shared locus, data is missing in two out of three samples, no pairs with data

  • locusC: is only a Shared locus in the pair Sample1-Sample3.

Unique haplotypes:

  • locusA: in pair Sample1-Sample2, there are no Unique haplotypes. in pair Sample1-Sample3, there are two Unique haplotypes: haplotype 00 is Unique for Sample3, haplotype 10 is Unique for Sample1. The same score applies to pair Sample2-Sample3.

  • locusB: haplotypes can not be evaluated, as it is not a Shared locus in any of the sample pairs.

  • locusC: haplotype 11 is Unique for Sample3 (but in heterozygous state with the Shared haplotype 00 in Sample3).

Shared haplotypes:

  • locusA: in pair Sample1-Sample2, there are two Shared haplotypes (00 and 10). in pair Sample1-Sample3, there are no Shared haplotypes. in pair Sample2-Sample3, there are no Shared haplotypes.

  • locusB: haplotypes can not be evaluated, as it is not a Shared locus in any of the sample pairs.

  • locusC: haplotype 00 is shared between Sample1 and Sample3.

Jaccard Distance

../_images/calculate_genetic_distance_jaccard.png

The Jaccard Distance counts the number of shared loci per pair and divides that by the sum of (Unique haplotypes for Sample1 + Unique haplotypes for Sample2 + Shared haplotypes).

For example in pair Sample1-Sample3:

  • Shared Haplotypes: locusA (0) + locusC (1) = 1 (a)

  • Unique Haplotypes Sample1: locusA (1) + locusC (0) = 1 (b)

  • Unique Haplotypes Sample2: locusA (1) + locusC (1) = 2 (c)

  • Jaccard = a/(a+b+c) = 1/(1+1+2) = 1/4 = 0.25

Information content of loci with Shared or Unique haplotypes

SMAP grm can also evaluate (dis)similarities at the level of the locus, which has great value for:

  • delineating loci with unique haplotypes per sample (diagnostic identification markers, e.g. cultivar detection)

  • identifying pairs of individals with (near-)completely identical haplotype sets (e.g. clones)

  • identifying pairs of individals with (near-)complete sets of loci with at least one haplotype shared per locus (e.g. parent-progeny pairs)

To fit these purposes, SMAP grm uses two optional parameters to define how loci are scored, based on the number of Unique and Shared haplotypes per locus in a sample pair: --locus_information_criterion and --partial.

../_images/calculate_genetic_distance_shared_unique.png

In the scheme above, note how the two different optional parameters --locus_information_criterion (indicated in red) and --partial (indicated in blue) control which loci are counted, depending on whether they have Shared or Unique haplotypes, and whether all haplotypes are Unique or Shared per locus (complete), or that only a part of the haplotypes are Unique or Shared (partial).

  • counts across all loci are indicated in green

  • counts at the level of haplotype (Unique / Shared haplotypes) are indicated in red

  • whether loci contain only Shared or Unique (complete) or part of the haplotype set per locus is Shared or Unique (partial) is indicated in blue

So, combining --locus_information_criterion and --partial creates four scenario’s for counting loci based on their constituent haplotypes:

  • Complete Unique: all haplotypes observed per locus are Unique. This means that if one of those haplotypes is observed, it can uniquely be assigned to that sample (of the pair). All haplotypes of this locus contain discriminatory information to distinguish the samples. For instance, in parent-progeny testing, a locus set that only contains loci with Complete Unique haplotypes, can always be used to confirm a F1 progeny plant, as the F1 individual will always receive one Unique haplotype (allele) from one parent and one other Unique haplotype (allele) from the other parent. Designing a molecular marker set would benefit from enriching for loci with high discriminative power, and running SMAP grm in Complete Unique mode helps to identify such loci.

  • Partial Unique: at least one haplotypes observed per locus is Unique for the sample pair. This means that if the Unique haplotype is observed, that haplotype can be assigned to the ‘donor’ sample. However, some of the haplotypes are also Shared and if those are detected, the haplotype can not be assigned to a specific sample of the pair, limiting discriminatory power.

  • Complete Shared: all haplotypes observed per locus are Shared. This means that the genotypes in the sample pair are genetically identical (at least at the loci screened). For instance, running SMAP grm on a sample set that may contain clonal material (i.e. with many loci expected to contain identical haplotypes in pairs of samples), will identify clones as having near-100% of the Complete Shared loci counted.

  • Partial Shared: at least one of the haplotypes observed per locus is Shared between the sample pair. When many (if not all) loci show Partially Shared haplotypes, then the two samples may be directly related, perhaps siblings or parent-progeny. A parental pair with many Complete Unique loci will pass on parent-specific Unique haplotypes (alleles) to its progeny. This also means that on all loci observed in the progeny, one may expect to find at least one Unique haplotype from parent1 and also one Unique haplotype from parent2. In other words, in pairwise parent-progeny pairs that locus is expected to show Partially Shared haplotypes (only one allele is shared each time with one of the parents, the other can not match because all alleles are parent-specific). A triplet of two parents (with many loci with Completely Unique haplotypes and very few loci with Partially Shared haplotypes), and a progeny with near-100% of loci with Partially Shared haplotypes with both parents is expected. So, parental testing is implemented by first running SMAP grm on genetic diversity data of parental lines and defining loci with Complete Unique haplotypes, and then running SMAP grm on the genotyping matrix of parents and progeny looking for loci with Partially Shared haplotypes to identify parent-progeny pairs.

Loci with Complete Unique haplotypes

To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set.

This scheme summarizes the different steps of SMAP grm, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Complete Unique haplotypes.
The SMAP grm output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with “Complete Unique haplotypes”: --locus_information_criterion unique (which creates a matrix showing the number of loci with unique haplotypes in each comparison e.g. locus5 in ind7 uniquely has haplotype d) and only loci with all unique haplotypes are counted (complete, default)
../_images/CU_individuals_distances.png
Loci with Complete Unique haplotypes are indicated in red.

Loci with Partial Unique haplotypes

To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set.

This scheme summarizes the different steps of SMAP grm, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Partial Unique haplotypes.
The SMAP grm output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with “Partial Unique haplotypes”: --locus_information_criterion unique (which creates a matrix showing the number of loci with unique haplotypes in each comparison e.g. locus5 in ind7 uniquely has haplotype d) and loci that contain at least one unique haplotype are counted (--partial).
../_images/PU_individuals_distances.png
Loci with Partial Unique haplotypes are indicated in purple.

Loci with Complete Shared haplotypes

To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set.

This scheme summarizes the different steps of SMAP grm, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Complete Shared haplotypes.
The SMAP grm output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with “Complete Shared haplotypes”: --locus_information_criterion shared (which creates a matrix showing the number of loci with shared haplotypes in each comparison e.g. locusX in indY and indZ both have haplotype a and b) and only loci with all shared haplotypes are counted (complete, default)
../_images/CS_individuals_distances.png
Loci with Complete Shared haplotypes are indicated in orange.

Loci with Partial Shared haplotypes

To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set.

This scheme summarizes the different steps of SMAP grm, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Partial Shared haplotypes.
The SMAP grm output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with “Partial Shared haplotypes”: --locus_information_criterion shared (which creates a matrix showing the number of loci with Shared haplotypes in each comparison e.g. locus5 in ind7 uniquely has haplotype d) and loci that share at least one haplotype are counted (--partial).
../_images/PS_individuals_distances.png
Loci with Partial Shared haplotypes are indicated in green.