Feature Description

The SMAP snp-seq module converts a reference sequence and target SNPs into template sequences and designs primers for each template sequence with Primer3 (Untergasser et al., 2012). Here, we describe the different features used by SMAP snp-seq.

This overview show the relationship between the different features and how input files and parameter settings are used to customize the workflow.

Reference

SMAP snp-seq requires at least a reference sequence (in FASTA format, mandatory --reference option).

SMAP snp-seq can also take a specific VCF file as input to change the reference sequence with the --customize_reference option.

This is a VCF file with SNP positions that differ between the reference sequence and all samples of the sample collection used to detect the SNPs. These positions are not polymorphic within the sample set used for SNP detection, but they differ between every sample and the reference sequence. To design primers that match the sample set genomes, the reference sequence needs to be adjusted by substituting specific positions with the alternative nucleotide. To adjust the reference sequence at appropriate positions, the user can provide a VCF file with all called differences between the reference sequence and the sample set to SMAP snp-seq using the --custom_reference_vcf option. These nucleotides will not be treated as target SNPs for primer design, as they are not polymorphic among samples. Instead, the reference sequence will be adjusted in-place by substituting the reference sequence nucleotides with the alternative nucleotides listed in the VCF file.

command to adjust the reference sequence at predefined SNP positions:

python3 SMAP_snp-seq.py --reference genome_reference.fa --customize_reference reference.vcf

The reference VCF file contains the postion of nucleotides in the reference sequence (orange) that are replaced by alternative nucleotides in the template sequence (green) prior to primer design.

Template regions

The template regions define the START and END coordinates of the sequence retrieved from the reference sequence and passed to Primer3 for amplicon design. This template sequence will further be modified by specifying one or more target regions (to be included in the amplicons, defined by the target_SNP.vcf file) and encoding background SNPs as ‘N’ to be avoided at primer binding sites (defined by the background_SNP.vcf file). Only SNPs provided with the --target_vcf option and overlapping with these regions are targeted for amplicon design, and primers flanking the target regions can only be designed within these template regions. By default, only one amplicon will be retained per template region, even if mutiple target regions are provided per template region.

based on predefined regions (BED)

The --template_region option is used to provide a BED file with predefined coordinates of template regions for primer design.

For example, a BED file can be created with:

evenly spaced marker loci (a 1 kb template region spaced at every 1 Mbp)

a set of genes obtained from a stuctural gene annotation file (e.g. the mRNA coordinates of 500 candidate genes)

a set of GBS loci obtained with SMAP delineate

Command to define target regions by predefined regions:

python3 SMAP_snp-seq.py --reference reference.fa --target_vcf TARGETS.vcf --template_region CandidateMarkerLoci.bed

The START coordinate of each target region in the BED file must be smaller than the END coordinate. A user can choose to extend all regions in the BED file at both ends with a given number of nucleotides specified by the --region_extension option (dark grey). SNPs in the target_VCF file that lie outside the template regions defined in the BED file are ignored for primer design (light grey).

default: remove overlap between target regions

Figure 7. Primer design using GBS data and a BED file with region extension (lower) compared to primer design using GBS data and a BED file without region extension (upper). The template sequences are extended at the 5’ and 3’ end of the genomic coordinates in the BED file (orange lines) to the new region ends (green lines).

Extensions may cause an overlap between neighbouring regions, creating template sequences with the same genomic fragments. This may result in artificial mispriming (see below), inhibiting primer design for these templates. To avoid such artificial mispriming, the script checks by default (also if regions are not extended) for overlapping regions. Regions that partially overlap with another region located more upstream in the contig are trimmed to the most upstream nucleotide that does not overlap with another region. If regions fully overlap, the shortest region is discarded (Figure 8). If preferred, overlapping sequence parts can be retained using the --retain_overlap option.

option --retain_overlap

Figure 8. Overview of the different scenario’s of region overlap before (upper) and after (lower) the exclusion of overlap. Partially overlapping regions are trimmed to the first non-overlapping nucleotide in the 5’-end of the most downstream region, while the shortest region of completely overlapping regions is removed.

based on target SNPs (VCF)

Template regions can also be defined based on the position and relative distance of SNPs in the vcf file provided with the mandatory --target_vcf option.

By default, SMAP snp-seq requires a single VCF file provided with the --target_vcf option, containing SNP positions (purple, first column: reference sequence ID, second column: genomic location of SNP). Based on the SNP coordinates listed in that vcf file, SNPs are grouped into template regions (bold) that, combined with their flanking regions (here, max 30 bp, italics), are retrieved from the reference sequence as template for primer design.

Variant distance

../_images/SMAP_snp-seq_overview_feature_SNPs.png

../_images/SMAP_snp-seq_overview_feature_region.png

Figure 4. Primer design using GBS data with a BED file containing genomic coordinates (lower) compared to primer design using GBS data without a BED file (upper). Template sequences are not allowed to exceed the genomic coordinates in the BED file (shown by the orange lines).

Command to define target regions based on SNPs only:

python3 SMAP_snp-seq.py --reference reference.fa --vcf VCF.vcf

Alternatively, a specific list of target SNPs (blue) can be provided with the additional --target_vcf option. Based on those coordinates, SNPs are grouped into target regions (bold) that, combined with their flanking regions (here, max 30 bp, italics), are retrieved from the reference sequence as template for primer design. In this case, the list of SNPs provided with the mandatory --vcf option (purple) is only used as ‘background’ polymorphisms that should be avoided during primer design (purple, encoded as ‘N’ in the template sequence).

command with specific target SNPs and background SNPs:

python3 SMAP_snp-seq.py --reference reference.fa --vcf VCF.vcf --target_vcf target.vcf

If SNPs were called (present in the VCF file(s) parsed with the --target_vcf option and/or the --background_vcf option) in the added fragments, they will be added to the template sequence.

Figure 4. Primer design using GBS data with a BED file containing genomic coordinates (lower) compared to primer design using GBS data without a BED file (upper). Template sequences are not allowed to exceed the genomic coordinates in the BED file (shown by the orange lines).

based on predefined regions (BED) and target SNPs (VCF)

Split template

If the regions are fairly large (e.g. gene sequences), the user can split these regions into multiple (neighboring, non-overlapping?) templates using the --split_template option. The template sequences will be constructed using the same procedure as stated above without using the --regions option. But only split regions overlapping with the --region BED file will be retained?

Figure 5. Primer design using WGS data and a BED file with the split_template option (lower) compared to primer design using WGS data and a BED file without the split_template option. The split template option splits the template sequence delineated in the BED file (orange lines) into multiple template sequences using the same reasoning as illustrated by figure 2.

Target region

By default, each SNP listed in the vcf file provided as input with the mandatory --vcf option is set as a target for primer design. Starting from a given SNP, any downstream SNPs within a distance specified by the --maximum_target_size option are grouped together as one target region.

command with definition of maximum target size:

python3 SMAP_snp-seq.py --reference reference.fa --vcf VCF.vcf --maximum_target_size 10

The --minimum_target_distance option specifies the distance in the reference sequence that is skipped after the end of a target region untill the next target region. Any SNPs in that region are ignored as targets for primer design. The first SNP after the minimum target distance is the start of the next target region, which again includes all neighboring SNPs within the --maximum_target_size region.

command with definition of maximum target size and minimum target distance:

python3 SMAP_snp-seq.py --reference reference.fa --vcf VCF.vcf --maximum_target_size 10 --minimum_target_distance 20

../_images/maximum_target_size_minimum_target_distance.png

SMAP snp-seq can also take a specific VCF file as input to define target SNPs:

If a VCF file with a user-defined selection of target SNPs is provided with the --target_vcf option, only SNPs in this VCF will be considered as potential targets for primer design. Target regions are defined as described directly above, taking the --maximum_target_size and --minimum_target_distance options into account. Other SNP positions listed in the VCF file provided with the mandatory --vcf option will be incorporated as “N” in the template sequences, but are not set as targets.

offset

Users can also opt to ignore targets that are close to the region borders by setting an offset (Figure 6). SNPs that are positioned within the offset are not considered as targets, but the positions of these variants are still marked in the template sequences to take them into account for primer design. By default, the script does not allow primer3 to design primers with unknown nucleotides (N) so that primer sequences never overlap with known SNP positions in the provided VCF file(s). If users want to design degenerated primers containing unknown nucleotides, the number of unknown nucleotides in primer sequences can be increased using the –maxn option.

Figure 6. Primer design using GBS data and a BED file with offsets (lower) compared to primer design using GBS data and a BED file without offsets (upper). SNPs within offset regions (delineated by the green lines) are incorporated in the template sequences, but not included in target regions.

From Template region to Template sequence

In order for Primer3 to create a pair of primers surrounding target regions, SMAP snp-seq extends the target region with extra upstream and downstream flanking regions, and retrieves the corresponding nucleotide sequence from the reference sequence, i.e. the ‘template sequence’.

Several options determine where the ‘template region’ is located with respect to the target region containing the target SNPs.

Target region with a single SNP If neighboring SNPs are located at a distance greater than that defined by the --variant_distance option, the target region only contains a single SNP.

If a single VCF file with SNP coordinates is provided as input to SMAP snp-seq with the --vcf option, the target region is placed in the center of the template sequence. In other words, nucleotide sequences upstream and downstream of the target region are extracted from the reference genome to create the template sequence.

If the distance between SNPs is larger than , single SNPs are located as the central nucleotide in a template sequence. For example, with the default --variant distance of 500 nucleotides, a template sequence with only one SNP will contain 500 nucleotides. The first 250 nucleotides are equal to the reference sequence upstream of the SNP, followed by the SNP (coded as ‘N’), and ending with another 249 nucleotides equal to the reference sequence downstream of the SNP.

SNPs located within a predefined distance from each other (defined by the --variant_distance option) are assigned to the same template sequence.

If that one SNP is located less than half the --variant_distance from the beginning or end of the reference sequence, the template sequence length becomes shorter than the variant distance.

For instance, if the SNP is located on reference position 201, the template sequence is shortened to 450 nucleotides (instead of 500 nucleotides). The template sequence then consists of the 200 nucleotides upstream of the SNP that are equal to the reference sequence, followed by the SNP (coded as ‘N’), and ending with another 249 nucleotides equal to the reference sequence downstream of the SNP.

If a second SNP is located 10 nucleotides downstream from a first SNP, the template sequence region is extended with 10 nucleotides, resulting in the following template sequence: 200 nucleotides equal to the reference sequence - the first SNP (‘N’) - 9 nucleotides equal to the reference sequence - the second SNP (‘N’) - 249 nucleotides equal to the reference sequence downstream of the second SNP.

So, as long as consecutive SNPs are located within a distance from each other that is equal to half the variant distance or less, they will be included in the same template sequence.

In theory, the size of a template sequence can be as large as the size of its corresponding reference sequence (Figure 2).

If the distance between the final nucleotide of a target and the next SNP is smaller than the value specified in the –target_distance option, that SNP will be skipped and not be included in any target. In other words, targets within the same template will be spaced from each other by a minimum distance equal to the –target_distance. Primer3 will attempt to develop only one primer pair for each template sequence, even if a template has multiple targets (Figure 2).

A template sequence with only one SNP usually has a size equal to the variant distance with the SNP positioned in the center of the template.

../_images/SMAP_snp-seq_overview_features.png

Primer3 amplicon design

amplicons

The minimum and maximum size of an amplicon (including the primer sequences) can be specified. It is recommended to keep the size range of amplicons relatively small (e.g. max 10 or 20 nucleotides difference between the shortest and largest amplicon). A small size range avoids the relative overamplification of shorter amplicons compared to the larger ones in the same multiplex PCR. Regions delineated in the BED file (optionally) provided with the --template_regions option must be at least as large as the minimum amplicon size. Shorter regions are not processed during primer design.

primers

The primer size range can also be adjusted using the options –minimum_primer_size, –optimal_primer_size, and –maximum_primer_size. The specified optimal size of a primer must always be larger than or equal to the minimum primer size and smaller than or equal to the maximum primer size. If not, the script will adjust the optimal primer size to the value that is closest to the defined primer size range (i.e. the minimum or maximum primer size if the optimal primer size was smaller than the minimum or larger as the maximum, respectively). The script also checks for mispriming (i.e. the annealing of primer sequences to (partially) complementary sequences), both within the same template sequence and in the other template sequences. Parts of the reference sequence that are not included in a template sequence are not checked for mispriming.

Output

The primer sequences are output to a tab-delimited text file in table format containing the amplicon ID (contig_ID:start-end), the orientation of the primer (forward or reverse), the genomic position of the first nucleotide in the primer sequence (start), the genomic position of the last nucleotide in the primer sequence (stop), and the nucleotide sequence (5’ to 3’). The start and end coordinates in the amplicon ID correspond to the first nucleotide downstream of the 3’-end of the forward primer and the last nucleotide upstream of the 3’-end of the reverse primer. In addition, the primer coordinates are stored in a BED file (0-based, start coordinate inclusive, stop coordinate exclusive), and a BED and GFF input file is created for SMAP haplotype-sites and SMAP haplotype-window, respectively. The coordinates of the extended amplicons (including primer regions) are also provided in a BED output file. The template sequences (including all modifications to the reference sequence) and the reference sequences of the amplicons (e.g. to perform a nucleotide BLAST) are also provided in FASTA format. All output files are saved in the specified output directory.