Scope & Usage

Scope

SMAP delineate analyzes read mapping positions and read depth distributions in stacked read alignments

Bioinformatics analyses that compare reads mapped to a common reference to identify sequence variants require that sufficient reads are mapped to the same reference genome locations across sample sets. However, a range of technical and biological aspects affect read mapping positions and read depth. So, it is important to first analyze if read mapping positions and read depth are consistent across the sample set, for the simple reason that if reads are not mapped to a given location, no variants can be identified in that sample. Here, we address the special case of loci with `Stacked short reads´ obtained with reduced representation libraries such as Genotyping-By-Sequencing (GBS). The SMAP delineate approach does not apply to random fragmented (e.g. Shotgun Sequencing) read data.

Input: SMAP delineate only requires sorted and indexed BAM files with aligned reads

Given a set of BAM files with GBS reads, SMAP delineate is a simple application to address the questions:

  1. Where are the reads located?

  2. How many loci with stacked reads are present per sample?

  3. Are mapping positions consistent across sample sets?

  4. Do polymorphisms occur in read mapping start and end positions within Stacked loci?

  5. How to select loci with sufficient read depth and completeness across the sample set for effective downstream variant calling?

Stack delineation captures within-sample and between-sample read mapping variation

SMAP delineate first creates Stacks by identifying sets of reads with identical read mapping start and end positions per sample.
The start and end positions of such Stacks are called Stack Mapping Anchor Points (SMAPs).
SMAP delineate then creates StackClusters by merging Stacks with positional overlap per sample (via bedtools merge), thus capturing SMAP polymorphisms per sample per locus.
SMAP delineate finally creates MergedClusters by merging StackClusters across all samples by positional overlap (via bedtools merge), thus capturing the variation of read mapping distribution across the sample set.

Output

SMAP delineate provides custom filters and creates a BED file with MergedCluster positions, i.e. high quality loci for downstream analyses (e.g. SNP variant calling).
SMAP delineate lists read mapping position polymorphisms as a unique type of molecular markers for downstream analyses (e.g. haplotype calling).
SMAP delineate plots feature distributions such as length, read depth, and number of SMAPs per Stack, StackCluster, and MergedCluster.
SMAP delineate plots a saturation curve and other graphs showing locus completeness across the sample set.

Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_delineate.png

SMAP delineate is run on BAM files directly after GBS read mapping, and before SMAP compare or SMAP haplotype-sites. SMAP delineate works on GBS data.


Guidelines for read preprocessing and mapping

Read preprocessing requires specific steps for single-digest or double-digest GBS, in combinations with single-end reads, paired-end reads or merged reads.
See Read preprocessing for a detailed description of recommended preprocessing and mapping per GBS method.

Commands & options

Mandatory options for SMAP delineate

As SMAP delineate is entirely data-driven, it does not need any prior information about which or how many different enzymes are used for GBS library construction. It is mandatory to specify the directory containing the BAM and BAI alignment files:

  • smap delineate alignments_dir ## Path to the directory containing BAM and BAI alignment files. All BAM files should be in the same directory [no default].

Based on whether reads are separately mapped or are merged before mapping, the user must mandatorily specify the corresponding option --mapping_orientation (See the section on strandedness for more information.):

  • -mapping_orientation stranded ## Simply use -mapping_orientation stranded for any BAM file that contains separately mapped reads. Note that this may be single-end or non-merged paired-end read data. In -mapping_orientation stranded mode, SMAP delineate will use the strand-specific read mapping orientation to delineate Stacks, StackClusters, and MergedClusters. Paired-end information is not used to extend Stacks of paired-end read pairs with internal overlap after read mapping. -mapping_orientation stranded means that only reads will be considered that map on the same strand as indicated per locus in the SMAP BED file.

  • -mapping_orientation ignore ### If paired-end reads are available and the insert library size is less than twice the read length, then we recommend to merge these reads before read mapping (e.g. with PEAR), and only map reads that were merged. By running SMAP delineate in -mapping_orientation ignore mode, such merged reads are combined into a Stack irrespective of strand-specific read mapping orientation , thus reducing redundancy in the number of unique marker loci on the reference genome and maximizing the effective read depth per StackCluster. -mapping_orientation ignore should be used to collect all reads per locus independent of the strand that the reads are mapped on (i.e. ignoring their mapping orientation).

Basic command to run SMAP delineate with default parameters:

smap delineate /path/to/BAM/ -mapping_orientation stranded
or
smap delineate /path/to/BAM/ -mapping_orientation ignore

Schematic overview of filtering options

../_images/SMAP_delineate.png

Command line options

See tabs below for specific filter options for Stacks, StackClusters, and MergedClusters and more detailed examples of command line options. It is mandatory to specify the directory containing the BAM and BAI alignment files, and the type of reads (separate or merged).

General options:

alignments_dir ########### (str) ### Path to the directory containing BAM and BAI alignment files. All BAM files should be in the same directory. Positional argument, should be the first argument after smap delineate [no default].
--mapping_orientation ############ Define the read mapping type. -mapping_orientation stranded for single-end reads or for paired-end reads that are mapped separately (without merging forward and reverse reads), -mapping_orientation ignore for paired-end reads that are merged before mapping.
-p, --processes ######### (int) ### Number of parallel processes [1].
--plot ####################### Select which plots are generated. --plot nothing disables plot generation. --plot summary only generates graphs with information across all samples, while --plot all will also generate per-sample plots [summary].
-t, --plot_type ################ Use this option to choose plot format, choices are png and pdf [png].
-n, --name ############# (str) ### Label to describe the sample set, will be added to the last column in the final SMAP BED file and is used by SMAP compare [Sample_Set1].
-u, --undefined_representation ##### Value to use for non-existing or masked data [NaN].
-h, --help ################### Show the full list of options. Disregards all other parameters.
-v, --version ################# Show the version. Disregards all other parameters.
--debug ###################### Enable verbose logging. Provides additional intermediate output files used for sample-specific QC, including the BED files for Stacks and StackClusters per sample.

General filtering options:

-q, --min_mapping_quality ## (int) ### Minimum read mapping quality to include a read in the analysis [30].

Options may be given in any order.

Command to run SMAP delineate with specified directory with BAM files, number of parallel processes, graphical output format, label for the sample set, and adjusted Mapping Quality:

smap delineate /path/to/BAM/ -mapping_orientation stranded -p 8 --plot_type png --name 2n_ind_GBS-SE --min_mapping_quality 20

Example commands

Typical command to run SMAP delineate for separately mapped single-end GBS reads in diploid individuals.

smap delineate /path/to/BAM/ -mapping_orientation stranded -p 8 --plot all --plot_type png --name 2n_ind_GBS-SE -f 50 -g 200 --min_stack_depth 3 --max_stack_depth 500 --min_cluster_depth 10 --max_stack_number 2 --min_stack_depth_fraction 10 --completeness 1 --max_smap_number 10

Output

By default, five plots are created to summarize locus features across the sample set; a locus saturation curve in function of total reads mapped per sample, a graph plotting the completeness of loci across the sample set, a graph of the read mapping polymorphisms (number of SMAPs) per locus, a graph containing the lengths of loci across the sample set, and a graph with the median read lengths per locus across the sample set.
Optionally, separate graphs of locus features can be plotted per sample and are strongly recommended for Quality Control of each new sample set and trouble-shooting. A graphical summary can be generated for each sample for the two incremental levels of read merging (Stacks and StackClusters), such as the distribution of read depth, length, and number of Stacks per locus.

An extensive collection of examples and explanations for different types of GBS libraries can be found in the section Example data analyses.
A sneak preview of the most important summary graphical output:
../_images/Graphical_summary.png