.. raw:: html .. role:: purple .. raw:: html .. role:: white ############# Scope & Usage ############# Scope ----- | **SMAP grm** converts a haplotype call table created by **SMAP haplotype-sites** or **SMAP haplotype-window** into pairwise genetic similarity/distance and/or locus information matrices. | **SMAP grm** works on any multi-allelic haplotype call table obtained from GBS, HiPlex or Shotgun sequencing data using the SMAP package. | **SMAP grm** output genetic relationship matrices (grm) are created in customised, high-quality figures or in standard output file formats for downstream data analyses. Integration in the SMAP workflow -------------------------------- .. image:: ../images/grm/SMAP_global_scheme_home_grm.png .. _SMAPgrmExampleInputFiles: Example input files ------------------- .. tabs:: .. tab:: haplotype call table .. csv-table:: :file: ../tables/grm/final_haplotype_discrete_dosage_PS_PC_CS_CU.tsv :delim: tab :header-rows: 1 Options may be given in any order. .. tab:: sample list | A list of sample names can be provided with the option ``--samples`` to select a subset of samples, to change the sample order or names, or combine data of multiple samples into a single sample (e.g. joining replicates into a single representative sample). | | select a subset of samples: .. csv-table:: :file: ../tables/grm/sample_names_individual_subset.txt :delim: tab :header-rows: 0 | change sample order and names: .. csv-table:: :file: ../tables/grm/sample_names_individual_names_order.txt :delim: tab :header-rows: 0 | group samples: .. csv-table:: :file: ../tables/grm/sample_names_group.txt :delim: tab :header-rows: 0 .. tab:: locus list | A list of locus names can be provided with the option ``--loci`` to only analyze a subset of loci. .. csv-table:: :file: ../tables/grm/locus_list.txt :delim: tab :header-rows: 0 Commands & options ------------------ :purple:`Mandatory options for SMAP grm` ``-t``, ``--table`` :white:`###########` *(str)* :white:`###` Name of the haplotype call table obtained with SMAP haplotype-sites or SMAP haplotype-window in the input directory [no default]. Basic command to run **SMAP grm** with default parameters:: smap grm --table /PATH/TO/Haplotype_table.tsv :purple:`Command line options` See tabs below for detailed command line options. .. tabs:: .. tab:: General options | ``-i``, ``--input_directory`` :white:`####` *(str)* :white:`###` Path to the directory containing the haplotype call table, the ``--samples`` text file, and/or the ``--loci`` text file [current directory]. | ``-n``, ``--samples`` :white:`##########` *(str)* :white:`###` Name of a tab-delimited text file in the input directory defining the order of the (new) sample IDs in the matrix: first column = old IDs, second column (optional) = new IDs [no list provided, the order of sample IDs in the grm equals their order in the haplotype call table]. | ``-l``, ``--loci`` :white:`############` *(str)* :white:`###` Name of a tab-delimited text file in the input directory containing a one-column list of locus IDs formatted as in the haplotype call table [no list provided]. | ``-p``, ``--processes`` :white:`#########` *(int)* :white:`###` Number of parallel processes [4]. | ``-t``, ``--plot_type`` :white:`###############` Use this option to choose plot format, choices are png and pdf [png]. | ``-h``, ``--help`` :white:`###################` Show the full list of options. Disregards all other parameters. | ``-v``, ``--version`` :white:`#################` Show the version. Disregards all other parameters. Options may be given in any order. .. tab:: Analysis options | ``-lc``, ``--locus_completeness`` :white:`####` *(float)* :white:`###..` Minimum proportion of samples with haplotype data in a locus. Loci with less data are removed [all loci are included]. | ``-sc``, ``--sample_completeness`` :white:`#####` *(int)* :white:`###..` Minimum number of loci with haplotype data in a sample. Samples with less data are removed [all samples are included]. | ``--include_non_shared_loci`` :white:`##############` Loci with data in only one out of two samples in each comparison are included in genetic similarity and locus information calculations [only loci with data in both samples of each comparison are included in calculations]. | ``-s``, ``--similarity_coefficient`` :white:`####` *(str)* :white:`###.` Coefficient used to express pairwise genetic similarity between samples [Jaccard, other options are: Sorensen-Dice and Ochiai]. | ``--distance`` :white:`########################..` Convert genetic similarity estimates into genetic distances [no conversion to distances]. | ``--distance_method`` :white:`#############` *(str)* :white:`###.` Method used for genetic distance calculations [Inversed, other option is: Euclidean]. | ``-lic``, ``--locus_information_criterion`` :white:`#######` Create a matrix showing the number of loci with shared or unique haplotypes in each comparison Shared: matrix showing the number of loci with shared haplotypes. Unique: matrix showing the number of loci with unique haplotypes [Shared]. | ``--partial`` :white:`##########################` Include loci in the locus information matrix with at least one shared/unique haplotype [only loci with only shared/unique haplotypes are included]. | ``--proportion_informative_loci`` :white:`############` Express the informative locus count as a proportion [locus information content is expressed in absolute numbers]. | ``-b``, ``--bootstrap`` :white:`###############` *(int)* :white:`###` The number of bootstrap replicates of the genetic similarity/distance matrix [no bootstrap replicates]. Options may be given in any order. .. tab:: Output options | ``-o``, ``--output_directory`` :white:`####` *(str)* :white:`###` Output directory [current directory]. | ``-s``, ``--suffix`` :white:`####` *(str)* :white:`###` Suffix added to all output file names [no suffix added]. | ``--print_sample_information`` :white:`####` *(str)* :white:`###` Print the similarity/distance matrix and the number of (shared) loci to the output directory as matrices (Matrix option, .csv file) and/or as plots (Plot option, file type specified by --plot_format option) {Matrix, Plot, All} [Matrix]. | ``--print_locus_information`` :white:`####` *(str)* :white:`###` Print locus information to the output directory as a matrix (Matrix option, .csv file) and/or as a plot (Plot option, file type specified by --plot_format option) showing the number of loci with shared or unique haplotypes in each comparison. The locus information per locus can also be printed as a tab-delimited list (List option, .txt file) showing the number of comparisons wherein the locus was informative and the sample IDs with unique haplotypes in all comparisons for each locus {None, Matrix, Plot, List, All} [locus information is not printed]. | ``--matrix_format`` :white:`####` *(str)* :white:`###` Format of the similarity/distance matrix {Phylip, Nexus} [Phylip]. | ``--plot_format`` :white:`####` *(str)* :white:`###` File format of plots {pdf, png, svg, jpg, jpeg, tif, tiff} [pdf]. | ``--mask`` :white:`####` *(str)* :white:`###` Mask values on the main diagonal of each matrix and above (Upper) or below (Lower) the main diagonal {None, Upper, Lower} [no masking]. | ``--annotate_matrix_plots`` :white:`###############` Annotate the matrix plots with values [no annotations]. | ``--no_matrix_plot_labels`` :white:`###############` Do not plot labels on the axes of matrix plots [labels are plotted on the axes of matrix plots]. | ``--plot_line_curves`` :white:`#################` Plot line curves showing the genetic correspondence between two samples in function of a cumulative number of loci [curves are not plotted]. | ``--list_line_curves`` :white:`##################` Tab-delimited text file containing a list of sample comparisons. first column ID of the first sample, second column ID of the second sample. of which line curves need to be plotted [no list specified, all comparisons are plotted if the ``--plot_line_curves`` option is specified]. | ``--locus_interval`` :white:`######` *(int)* :white:`###` Interval in the number of loci between two consecutive points in the line curves [10]. | ``-f``, ``--font`` :white:`######` *(str)* :white:`###` Font used in all plots {Times New Roman, Arial, Calibri, Consolas, Verdana, Helvetica, Comic Sans MS} [Times_New_Roman]. | ``--title_fontsize`` :white:`####` *(float)* :white:`###` Title font size in points [12]. | ``--label_fontsize`` :white:`####` *(float)* :white:`###` Label font size in points [12]. | ``--tick_fontsize`` :white:`####` *(float)* :white:`###` Tick font size in points [8]. | ``--legend_fontsize`` :white:`####` *(float)* :white:`###` Legend font size in points [10]. | ``--legend_position`` :white:`####` *(float,float)* :white:`###` Pair of coordinates defining the x (first number) and y (second number) position of the legend in the line curve plots [1,1, *i.e.* position the legend in the top right corner of the plot]. | ``-r``, ``--plot_resolution`` :white:`####` *(float)* :white:`###` Plot resolution in dots per inch (dpi) [300]. | ``--colour_map`` :white:`####` *(str)* :white:`###` Colour palette used in the matrix plots. The (Perceptually Uniform) Sequential colormaps that are shown on the site of `MatPlotLib `_ are accepted [bone]. | ``{viridis, plasma, inferno, magma, cividis, Greys, Purples, Blues, Greens, Oranges, Reds, YlOrBr, YlOrRd, OrRd, PuRd, RdPu, BuPu, GnBu, PuBu, YlGnBu, PuBuGn, BuGn, YlGn, binary, gist_yarg, gist_gray, gray, bone, pink, spring, summer, autumn, winter, cool, Wistia, hot, afmhot, gist_heat, copper, PiYG, PRGn, BrBG, PuOr, RdGy, RdBu, RdYlBu, RdYlGn, Spectral, coolwarm, bwr, seismic, twilight, twilight_shifted, hsv}`` Options may be given in any order. Example commands ---------------- Usage:: smap grm -t, --table TABLE [-i INPUT_DIRECTORY] [-n SAMPLES] [-l LOCI] [-lc LOCUS_COMPLETENESS] [-sc SAMPLE_COMPLETENESS] [--include_non_shared_loci] [--similarity_coefficient {Jaccard, Sorensen-Dice, Ochiai}] [--distance] [--distance_method {Inversed, Euclidean}] [-lic LOCUS_INFORMATION_CRITERION {Shared, Unique}] [--partial] [--proportion_informative_loci] [-b BOOTSTRAP] [-p PROCESSES] [-o OUTPUT_DIRECTORY] [-s SUFFIX] [--print_sample_information {Matrix, Plot, All}] [--print_locus_information {None, Matrix, Plot, List, All}] [--matrix_format {Phylip, Nexus}] [--plot_format {pdf, png, svg, jpg, jpeg, tif, tiff}] [--mask {None, Upper, Lower}] [--annotate_matrix_plots] [--no_matrix_plot_labels] [--plot_line_curves] [--list_line_curves LIST_LINE_CURVES] [--locus_interval LOCUS_INTERVAL] [-f {Times New Roman, Arial, Calibri, Consolas, Verdana, Helvetica, Comic Sans MS}] [--title_fontsize TITLE_FONTSIZE] [--label_fontsize LABEL_FONTSIZE] [--tick_fontsize TICK_FONTSIZE] [--legend_fontsize LEGEND_FONTSIZE] [--legend_position X, Y] [-r PLOT_RESOLUTION] [--colour_map {viridis, plasma, inferno, magma, cividis, Greys, Purples, Blues, Greens, Oranges, Reds, YlOrBr, YlOrRd, OrRd, PuRd, RdPu, BuPu, GnBu, PuBu, YlGnBu, PuBuGn, BuGn, YlGn, binary, gist_yarg, gist_gray, gray, bone, pink, spring, summer, autumn, winter, cool, Wistia, hot, afmhot, gist_heat, copper, PiYG, PRGn, BrBG, PuOr, RdGy, RdBu, RdYlBu, RdYlGn, Spectral, coolwarm, bwr, seismic, twilight, twilight_shifted, hsv}] Output ------ All output files are saved to the user-defined output directory (default = current directory). The output directory is created by the script if the directory did not exist. The option ``--print_sample_information`` creates tab-delimited text files (Matrix, default), or plots heatmaps (Plot), or both (All). .. tabs:: .. tab:: input haplotype call matrix | To illustrate the different kinds of output that can be created, a simulated haplotype call matrix was created that includes various scenarios of shared and unique haplotype calls across a small sample set. | In the next tabs, the **SMAP grm** output is created by comparing 10 individuals at all 12 loci (left hand panel), and using settings for "complete unique loci": ``--locus_information_criterion Unique`` (which creates a matrix showing the number of loci with unique haplotypes in each comparison *e.g.* locus5 in ind7 uniquely has haplotype *d*) and only loci with unique haplotypes are counted (complete, default) | The other use case scenario's (here shaded) are discussed in detail in the How It Works section. .. image:: ../images/grm/grm_haplotype_call_matrix_individuals.png .. tab:: Jaccard Inversed Distance matrix | By default, the genetic similarity matrix is printed to a tab-delimited text file. | The default format of these tab-delimited text files is phylip, which can directly be imported in R (R core team, 2020) to reconstruct PCoA plots with the *cmdscale* function or a distance-based phylogenetic tree (*e.g.* with the *nj* function in the R package ape (Paradis & Schliep, 2018)). | Alternatively, matrices can be exported in Nexus format, which can for instance be used to reconstruct distance-based phylogenetic networks with the program *SplitsTree* (Huson & Bryant, 2006). | Additionally, the genetic similarity matrix can be printed as heatmap (pdf, png, svg, jpg, tif). .. tabs:: .. tab:: table .. csv-table:: :file: ../tables/grm/Similarities_Jaccard_individual_CU.dist :delim: tab :header-rows: 1 .. tab:: heatmap .. image:: ../images/grm/Jaccard_genetic_similarity_heatmap_annot_individual_CU.png | The font name and size of different elements in the graphs are changed with options ``--font``, ``--title_fontsize``, ``--label_fontsize``, ``--tick_fontsize``, and ``--legend_fontsize``. The legend position and resolution of the plot are adjusted with the options ``--legend_position`` and ``--plot_resolution``. Customize the colour scale in the matrices with the option ``--colour_map``. Use option ``--mask`` to mask one half of the matrix (either all elements above the main diagonal or all elements below the main diagonal). .. tab:: informative loci | SMAP grm can create matrices with the number or proportion of informative loci in every sample pair, printed to tab-delimited text files (phylip or nexus format) and/or as heatmap. | SMAP grm creates a matrix with the number of shared loci (with data) per sample pair. | SMAP grm creates a list of all loci with the information content per locus, optionally listing samples that contain a unique haplotype for a given locus, if using option ``--locus_information_criterion unique``. .. tabs:: .. tab:: matrix of loci with Complete Unique haplotypes .. tabs:: .. tab:: tab-delimited This table (NumberOfCompletelyUniqueLoci.txt) shows the (absolute) number of loci with complete unique haplotypes per sample pair, in a matrix of all pairwise comparisons. .. csv-table:: :file: ../tables/grm/NumberOfCompletelyUniqueLoci_individual_CU.txt :delim: tab :header-rows: 1 | With option ``--proportion_informative_loci``, table (ProportionOfCompletelyUniqueLoci.txt) lists the proportion of the number of shared loci per sample pair (loci with data in both samples). .. csv-table:: :file: ../tables/grm/ProportionOfCompletelyUniqueLoci_individual_CU_prop.txt :delim: tab :header-rows: 1 .. tab:: heatmap .. image:: ../images/grm/NumberOfCompletelyUniqueLoci_heatmap_annot_individual_CU.png | With option ``--proportion_informative_loci``, table (ProportionOfCompletelyUniqueLoci.txt) lists the proportion of the number of shared loci per sample pair (loci with data in both samples). .. image:: ../images/grm/NumberOfCompletelyUniqueLoci_heatmap_annot_individual_CU_prop.png | The font name and size of different elements in the graphs are changed with options ``--font``, ``--title_fontsize``, ``--label_fontsize``, ``--tick_fontsize``, and ``--legend_fontsize``. The legend position and resolution of the plot are adjusted with the options ``--legend_position`` and ``--plot_resolution``. Customize the colour scale in the matrices with the option ``--colour_map``. Use option ``--mask`` to mask one half of the matrix (either all elements above the main diagonal or all elements below the main diagonal). .. tab:: nr of shared loci (with data) This table (NumberOfSharedLoci.txt) shows the number of loci with data in both samples per pair. .. csv-table:: :file: ../tables/grm/NumberOfSharedLoci_individual_CU_prop.txt :delim: tab :header-rows: 1 .. tab:: locus list This table (CompletelyUniqueLoci.txt) shows a list of all loci included in the analyses, the total number of sample pairs for which the loci were considered, the number of sample pairs for which the loci were informative, and the proportion of sample pairs for which the loci were informative. If the ``--locus_information_criterion`` is set to ‘unique’, the names of samples with a unique haplotype for the corresponding locus across all sample pairs are also listed in this file (last column). .. csv-table:: :file: ../tables/grm/CompletelyUniqueLoci_individual_CU.txt :delim: tab :header-rows: 1 .. tab:: rarefaction curves for locus completeness | Bootstrap replicates of the genetic similarity/distance matrix can be constructed to obtain statistical support for the inferred relationships between accessions, for instance in a distance-based phylogenetic tree. Pairwise genetic similarity/distances in bootstrap replicates are estimated in a set of loci that is randomly sampled with replacement from the original set of loci. Data are shown of a set of 38 samples, with 60.000-120.000 loci each. .. image:: ../images/grm/rarefaction_loci_triosteum.png