.. raw:: html .. role:: purple .. raw:: html .. role:: white ################### Feature Description ################### .. _SMAPgrmdef: Setting the stage ----------------- **SMAP grm**'s key function is to analyze pairwise genetic similarity, or conversely, the loci that discriminate between a set of samples based on a genotype table. :purple:`Haplotype call table` **SMAP grm** specifically works on the haplotype call table created by **SMAP haplotype-sites** or **SMAP haplotype-window**. Haplotype calls in the table can be discrete (*e.g.* 0, 1 or 2) for individuals, or haplotype frequencies (value from 0 to 1) for pool-Seq data. This haplotype call table lists the absence and presence of haplotypes as multi-allelic markers for a given set of loci, across a given set of samples. The haplotype call table is the only input required to run the script. :purple:`Genetic distance` Genetic distance may be calculated in different ways, but is essentially based on the proportion of shared haplotypes over all observed haplotypes. A pair of samples that contain only the same haplotypes (across all loci), have a distance of 0 and a similarity of 1. A pair of samples that each contain unique haplotypes that the other sample does not contain (across all loci), have a distance of 1 and a similarity of 0. Clearly, genetic distance can only be calculated if samples both have data at the same sets of loci, otherwise their haplotypes can not be compared. :purple:`The difference between "Shared loci" and "Shared or Unique haplotypes"` It is important to understand the clear distinction between these definitions: :purple:`Shared loci:` Loci that have a call of at least one haplotype in each of two samples in a pair are called 'shared loci' (loci with data in a sample pair). Loci with missing data are excluded. Given a pair of samples with a shared locus, haplotypes can be compared. :purple:`Unique haplotype:` a haplotype that occurs in one sample but not in the other, is defined as a "Unique haplotype". :purple:`Shared haplotype:` a haplotype that occurs in both samples of a pair, is defined as a "Shared haplotype". Note that Unique and Shared haplotypes can only be evaluated in shared loci (with data in both samples). Then, there are two ways in which similarities and/or discriminatory loci are calculated, based either on Shared or Unique haplotypes directly (like Jaccard Distance), or by aggregating on "loci that contain Shared or Unique haplotypes". To clarify the concepts and definitions, check out the scheme and explanation below. ========== ====== =========== ======== ======== ======== Reference Locus Haplotypes Sample1 Sample2 Sample3 ========== ====== =========== ======== ======== ======== Chrom1 A 00 0 0 1 Chrom1 A 10 1 1 0 Chrom1 B 000 1 Chrom1 B 100 0 Chrom1 C 00 1 1 Chrom1 C 11 0 1 ========== ====== =========== ======== ======== ======== Shared loci: - locusA: is a Shared locus in all pairs of samples. - locusB: is never a Shared locus, data is missing in two out of three samples, no pairs with data - locusC: is only a Shared locus in the pair Sample1-Sample3. Unique haplotypes: - locusA: in pair Sample1-Sample2, there are no Unique haplotypes. in pair Sample1-Sample3, there are two Unique haplotypes: haplotype 00 is Unique for Sample3, haplotype 10 is Unique for Sample1. The same score applies to pair Sample2-Sample3. - locusB: haplotypes can not be evaluated, as it is not a Shared locus in any of the sample pairs. - locusC: haplotype 11 is Unique for Sample3 (but in heterozygous state with the Shared haplotype 00 in Sample3). Shared haplotypes: - locusA: in pair Sample1-Sample2, there are two Shared haplotypes (00 and 10). in pair Sample1-Sample3, there are no Shared haplotypes. in pair Sample2-Sample3, there are no Shared haplotypes. - locusB: haplotypes can not be evaluated, as it is not a Shared locus in any of the sample pairs. - locusC: haplotype 00 is shared between Sample1 and Sample3. Jaccard Distance ---------------- .. image:: ../images/grm/calculate_genetic_distance_jaccard.png The Jaccard Distance counts the number of shared loci per pair and divides that by the sum of (Unique haplotypes for Sample1 + Unique haplotypes for Sample2 + Shared haplotypes). For example in pair Sample1-Sample3: - Shared Haplotypes: locusA (0) + locusC (1) = 1 (a) - Unique Haplotypes Sample1: locusA (1) + locusC (0) = 1 (b) - Unique Haplotypes Sample2: locusA (1) + locusC (1) = 2 (c) - Jaccard = a/(a+b+c) = 1/(1+1+2) = 1/4 = 0.25 Information content of loci with Shared or Unique haplotypes ------------------------------------------------------------ **SMAP grm** can also evaluate (dis)similarities at the level of the locus, which has great value for: - delineating loci with unique haplotypes per sample (diagnostic identification markers, *e.g.* cultivar detection) - identifying pairs of individals with (near-)completely identical haplotype sets (*e.g.* clones) - identifying pairs of individals with (near-)complete sets of loci with at least one haplotype shared per locus (*e.g.* parent-progeny pairs) To fit these purposes, SMAP grm uses two optional parameters to define how loci are scored, based on the number of Unique and Shared haplotypes per locus in a sample pair: ``--locus_information_criterion`` and ``--partial``. .. image:: ../images/grm/calculate_genetic_distance_shared_unique.png In the scheme above, note how the two different optional parameters ``--locus_information_criterion`` (indicated in red) and ``--partial`` (indicated in blue) control which loci are counted, depending on whether they have Shared or Unique haplotypes, and whether *all* haplotypes are Unique or Shared per locus (complete), or that only a part of the haplotypes are Unique or Shared (partial). - counts across all loci are indicated in green - counts at the level of haplotype (Unique / Shared haplotypes) are indicated in red - whether loci contain only Shared or Unique (complete) or part of the haplotype set per locus is Shared or Unique (partial) is indicated in blue So, combining ``--locus_information_criterion`` and ``--partial`` creates four scenario's for counting loci based on their constituent haplotypes: - :purple:`Complete Unique:` all haplotypes observed per locus are Unique. This means that if one of those haplotypes is observed, it can uniquely be assigned to that sample (of the pair). All haplotypes of this locus contain discriminatory information to distinguish the samples. For instance, in parent-progeny testing, a locus set that only contains loci with Complete Unique haplotypes, can always be used to confirm a F1 progeny plant, as the F1 individual will always receive one Unique haplotype (allele) from one parent and one other Unique haplotype (allele) from the other parent. Designing a molecular marker set would benefit from enriching for loci with high discriminative power, and running **SMAP grm** in Complete Unique mode helps to identify such loci. - :purple:`Partial Unique:` at least one haplotypes observed per locus is Unique for the sample pair. This means that if the Unique haplotype is observed, that haplotype can be assigned to the 'donor' sample. However, some of the haplotypes are also Shared and if those are detected, the haplotype can not be assigned to a specific sample of the pair, limiting discriminatory power. - :purple:`Complete Shared:` all haplotypes observed per locus are Shared. This means that the genotypes in the sample pair are genetically identical (at least at the loci screened). For instance, running **SMAP grm** on a sample set that may contain clonal material (*i.e.* with many loci expected to contain identical haplotypes in pairs of samples), will identify clones as having near-100% of the Complete Shared loci counted. - :purple:`Partial Shared:` at least one of the haplotypes observed per locus is Shared between the sample pair. When many (if not all) loci show Partially Shared haplotypes, then the two samples may be directly related, perhaps siblings or parent-progeny. A parental pair with many Complete Unique loci will pass on parent-specific Unique haplotypes (alleles) to its progeny. This also means that on all loci observed in the progeny, one may expect to find at least one Unique haplotype from parent1 and also one Unique haplotype from parent2. In other words, in pairwise parent-progeny pairs that locus is expected to show Partially Shared haplotypes (only one allele is shared each time with one of the parents, the other can not match because all alleles are parent-specific). A triplet of two parents (with many loci with Completely Unique haplotypes and very few loci with Partially Shared haplotypes), and a progeny with near-100% of loci with Partially Shared haplotypes with *both* parents is expected. So, parental testing is implemented by first running **SMAP grm** on genetic diversity data of parental lines and defining loci with Complete Unique haplotypes, and then running **SMAP grm** on the genotyping matrix of parents and progeny looking for loci with Partially Shared haplotypes to identify parent-progeny pairs. Loci with Complete Unique haplotypes ------------------------------------ To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set. .. tabs:: .. tab:: input haplotype call matrix | This scheme summarizes the different steps of **SMAP grm**, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Complete Unique haplotypes. | The **SMAP grm** output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with "Complete Unique haplotypes": ``--locus_information_criterion unique`` (which creates a matrix showing the number of loci with unique haplotypes in each comparison *e.g.* locus5 in ind7 uniquely has haplotype *d*) and only loci with *all* unique haplotypes are counted (complete, default) .. image:: ../images/grm/CU_individuals_distances.png | Loci with Complete Unique haplotypes are indicated in red. .. tab:: informative loci | SMAP grm can create matrices with the number or proportion of informative loci in every sample pair, printed to tab-delimited text files (phylip or nexus format) and/or as heatmap. | SMAP grm creates a matrix with the number of shared loci (with data) per sample pair. | SMAP grm creates a list of all loci with the information content per locus, optionally listing samples that contain a unique haplotype for a given locus, if using option ``--locus_information_criterion unique``. .. tabs:: .. tab:: matrix of loci with Complete Unique haplotypes .. tabs:: .. tab:: tab-delimited This table (NumberOfCompletelyUniqueLoci.txt) shows the (absolute) number of loci with Complete Unique haplotypes per sample pair, in a matrix of all pairwise comparisons. .. csv-table:: :file: ../tables/grm/NumberOfCompletelyUniqueLoci_individual_CU.txt :delim: tab :header-rows: 1 | With option ``--proportion_informative_loci``, table (ProportionOfCompletelyUniqueLoci.txt) lists the proportion of the number of shared loci per sample pair (loci with data in both samples). .. csv-table:: :file: ../tables/grm/ProportionOfCompletelyUniqueLoci_individual_CU_prop.txt :delim: tab :header-rows: 1 .. tab:: heatmap .. image:: ../images/grm/NumberOfCompletelyUniqueLoci_heatmap_annot_individual_CU.png | With option ``--proportion_informative_loci``, the heatmap shows the proportion of the number of shared loci per sample pair (loci with data in both samples). .. image:: ../images/grm/NumberOfCompletelyUniqueLoci_heatmap_annot_individual_CU_prop.png | The font name and size of different elements in the graphs are changed with options ``--font``, ``--title_fontsize``, ``--label_fontsize``, ``--tick_fontsize``, and ``--legend_fontsize``. The legend position and resolution of the plot are adjusted with the options ``--legend_position`` and ``--plot_resolution``. Customize the colour scale in the matrices with the option ``--colour_map``. Use option ``--mask`` to mask one half of the matrix (either all elements above the main diagonal or all elements below the main diagonal). .. tab:: nr of shared loci (with data) This table (NumberOfSharedLoci.txt) shows the number of loci with data in both samples per pair. .. csv-table:: :file: ../tables/grm/NumberOfSharedLoci_individual_CU_prop.txt :delim: tab :header-rows: 1 .. tab:: locus list This table (CompletelyUniqueLoci.txt) shows a list of all loci included in the analyses, the total number of sample pairs for which the loci were considered, the number of sample pairs for which the loci were informative, and the proportion of sample pairs for which the loci were informative. If the ``--locus_information_criterion`` is set to ‘unique’, the names of samples with a unique haplotype for the corresponding locus across all sample pairs are also listed in this file (last column). .. csv-table:: :file: ../tables/grm/CompletelyUniqueLoci_individual_CU.txt :delim: tab :header-rows: 1 Loci with Partial Unique haplotypes ----------------------------------- To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set. .. tabs:: .. tab:: input haplotype call matrix | This scheme summarizes the different steps of **SMAP grm**, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Partial Unique haplotypes. | The **SMAP grm** output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with "Partial Unique haplotypes": ``--locus_information_criterion unique`` (which creates a matrix showing the number of loci with unique haplotypes in each comparison *e.g.* locus5 in ind7 uniquely has haplotype *d*) and loci that contain at least one unique haplotype are counted (``--partial``). .. image:: ../images/grm/PU_individuals_distances.png | Loci with Partial Unique haplotypes are indicated in purple. .. tab:: informative loci | SMAP grm can create matrices with the number or proportion of informative loci in every sample pair, printed to tab-delimited text files (phylip or nexus format) and/or as heatmap. | SMAP grm creates a matrix with the number of shared loci (with data) per sample pair. | SMAP grm creates a list of all loci with the information content per locus, optionally listing samples that contain a unique haplotype for a given locus, if using option ``--locus_information_criterion unique``. .. tabs:: .. tab:: matrix of loci with Partial Unique haplotypes .. tabs:: .. tab:: tab-delimited This table (NumberOfPartiallyUniqueLoci.txt) shows the (absolute) number of loci with Partial Unique haplotypes per sample pair, in a matrix of all pairwise comparisons. .. csv-table:: :file: ../tables/grm/NumberOfPartiallyUniqueLoci_individual_PU.txt :delim: tab :header-rows: 1 | With option ``--proportion_informative_loci``, table (ProportionOfPartiallyUniqueLoci.txt) lists the proportion of the number of shared loci per sample pair (loci with data in both samples). .. csv-table:: :file: ../tables/grm/ProportionOfPartiallyUniqueLoci_individual_PU_prop.txt :delim: tab :header-rows: 1 .. tab:: heatmap .. image:: ../images/grm/NumberOfPartiallyUniqueLoci_heatmap_annot_individual_PU.png | With option ``--proportion_informative_loci``, the heatmap shows the proportion of the number of shared loci per sample pair (loci with data in both samples). .. image:: ../images/grm/NumberOfPartiallyUniqueLoci_heatmap_annot_individual_PU_prop.png | The font name and size of different elements in the graphs are changed with options ``--font``, ``--title_fontsize``, ``--label_fontsize``, ``--tick_fontsize``, and ``--legend_fontsize``. The legend position and resolution of the plot are adjusted with the options ``--legend_position`` and ``--plot_resolution``. Customize the colour scale in the matrices with the option ``--colour_map``. Use option ``--mask`` to mask one half of the matrix (either all elements above the main diagonal or all elements below the main diagonal). .. tab:: nr of shared loci (with data) This table (NumberOfSharedLoci.txt) shows the number of loci with data in both samples per pair. .. csv-table:: :file: ../tables/grm/NumberOfSharedLoci_individual_PU.txt :delim: tab :header-rows: 1 .. tab:: locus list This table (PartiallyUniqueLoci.txt) shows a list of all loci included in the analyses, the total number of sample pairs for which the loci were considered, the number of sample pairs for which the loci were informative (contain Partial Unique haplotypes), and the proportion of sample pairs for which the loci were informative. If the ``--locus_information_criterion`` is set to ‘unique’, the names of samples with at least one unique haplotype for the corresponding locus across all sample pairs are also listed in this file (last column). .. csv-table:: :file: ../tables/grm/PartiallyUniqueLoci_individual_PU.txt :delim: tab :header-rows: 1 Loci with Complete Shared haplotypes ------------------------------------ To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set. .. tabs:: .. tab:: input haplotype call matrix | This scheme summarizes the different steps of **SMAP grm**, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Complete Shared haplotypes. | The **SMAP grm** output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with "Complete Shared haplotypes": ``--locus_information_criterion shared`` (which creates a matrix showing the number of loci with shared haplotypes in each comparison *e.g.* locusX in indY and indZ both have haplotype *a* and *b*) and only loci with all shared haplotypes are counted (complete, default) .. image:: ../images/grm/CS_individuals_distances.png | Loci with Complete Shared haplotypes are indicated in orange. .. tab:: informative loci | SMAP grm can create matrices with the number or proportion of informative loci in every sample pair, printed to tab-delimited text files (phylip or nexus format) and/or as heatmap. | SMAP grm creates a matrix with the number of shared loci (with data) per sample pair. .. tabs:: .. tab:: matrix of loci with Complete Shared haplotypes .. tabs:: .. tab:: tab-delimited This table (NumberOfCompletelySharedLoci.txt) shows the (absolute) number of loci with Complete Shared haplotypes per sample pair, in a matrix of all pairwise comparisons. .. csv-table:: :file: ../tables/grm/NumberOfCompletelySharedLoci_individual_CS.txt :delim: tab :header-rows: 1 | With option ``--proportion_informative_loci``, table (ProportionOfCompletelySharedLoci.txt) lists the proportion of the number of shared loci per sample pair (loci with data in both samples). .. csv-table:: :file: ../tables/grm/ProportionOfCompletelySharedLoci_individual_CS_prop.txt :delim: tab :header-rows: 1 .. tab:: heatmap .. image:: ../images/grm/NumberOfCompletelySharedLoci_heatmap_annot_individual_CS.png | With option ``--proportion_informative_loci``, the heatmap shows the proportion of the number of shared loci per sample pair (loci with data in both samples). .. image:: ../images/grm/NumberOfCompletelySharedLoci_heatmap_annot_individual_CS_prop.png | The font name and size of different elements in the graphs are changed with options ``--font``, ``--title_fontsize``, ``--label_fontsize``, ``--tick_fontsize``, and ``--legend_fontsize``. The legend position and resolution of the plot are adjusted with the options ``--legend_position`` and ``--plot_resolution``. Customize the colour scale in the matrices with the option ``--colour_map``. Use option ``--mask`` to mask one half of the matrix (either all elements above the main diagonal or all elements below the main diagonal). .. tab:: nr of shared loci (with data) This table (NumberOfSharedLoci.txt) shows the number of loci with data in both samples per pair. .. csv-table:: :file: ../tables/grm/NumberOfSharedLoci_individual_CS.txt :delim: tab :header-rows: 1 .. tab:: locus list This table (CompletelySharedLoci.txt) shows a list of all loci included in the analyses, the total number of sample pairs for which the loci were considered, the number of sample pairs for which the loci were informative (contain Complete Shared haplotypes), and the proportion of sample pairs for which the loci were informative. .. csv-table:: :file: ../tables/grm/CompletelySharedLoci_individual_CS.txt :delim: tab :header-rows: 1 Loci with Partial Shared haplotypes ----------------------------------- To illustrate the different kinds of analyses that can be performed, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set. .. tabs:: .. tab:: input haplotype call matrix | This scheme summarizes the different steps of **SMAP grm**, applied to a set of 10 individuals, and calculating the Jaccard Distance and loci with Partial Shared haplotypes. | The **SMAP grm** output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for loci with "Partial Shared haplotypes": ``--locus_information_criterion shared`` (which creates a matrix showing the number of loci with Shared haplotypes in each comparison *e.g.* locus5 in ind7 uniquely has haplotype *d*) and loci that share at least one haplotype are counted (``--partial``). .. image:: ../images/grm/PS_individuals_distances.png | Loci with Partial Shared haplotypes are indicated in green. .. tab:: informative loci | SMAP grm can create matrices with the number or proportion of informative loci in every sample pair, printed to tab-delimited text files (phylip or nexus format) and/or as heatmap. | SMAP grm creates a matrix with the number of shared loci (with data) per sample pair. | SMAP grm creates a list of all loci with the information content per locus, optionally listing samples that contain a unique haplotype for a given locus, if using option ``--locus_information_criterion unique``. .. tabs:: .. tab:: matrix of loci with Partial Shared haplotypes .. tabs:: .. tab:: tab-delimited This table (NumberOfPartiallySharedLoci.txt) shows the (absolute) number of loci with Partial Shared haplotypes per sample pair, in a matrix of all pairwise comparisons. .. csv-table:: :file: ../tables/grm/NumberOfPartiallySharedLoci_individual_PS.txt :delim: tab :header-rows: 1 | With option ``--proportion_informative_loci``, table (ProportionOfPartiallySharedLoci.txt) lists the proportion of the number of shared loci per sample pair (loci with data in both samples). .. csv-table:: :file: ../tables/grm/ProportionOfPartiallySharedLoci_individual_PS_prop.txt :delim: tab :header-rows: 1 .. tab:: heatmap .. image:: ../images/grm/NumberOfPartiallySharedLoci_heatmap_annot_individual_PS.png | With option ``--proportion_informative_loci``, the heatmap shows the proportion of the number of shared loci per sample pair (loci with data in both samples). .. image:: ../images/grm/NumberOfPartiallySharedLoci_heatmap_annot_individual_PS_prop.png | The font name and size of different elements in the graphs are changed with options ``--font``, ``--title_fontsize``, ``--label_fontsize``, ``--tick_fontsize``, and ``--legend_fontsize``. The legend position and resolution of the plot are adjusted with the options ``--legend_position`` and ``--plot_resolution``. Customize the colour scale in the matrices with the option ``--colour_map``. Use option ``--mask`` to mask one half of the matrix (either all elements above the main diagonal or all elements below the main diagonal). .. tab:: nr of shared loci (with data) This table (NumberOfSharedLoci.txt) shows the number of loci with data in both samples per pair. .. csv-table:: :file: ../tables/grm/NumberOfSharedLoci_individual_PS.txt :delim: tab :header-rows: 1 .. tab:: locus list This table (PartiallySharedLoci.txt) shows a list of all loci included in the analyses, the total number of sample pairs for which the loci were considered, the number of sample pairs for which the loci were informative (contain Partial Shared haplotypes), and the proportion of sample pairs for which the loci were informative. .. csv-table:: :file: ../tables/grm/PartiallySharedLoci_individual_PS.txt :delim: tab :header-rows: 1