Scope & Usage

Scope

SMAP target-selection is run prior to SMAP design. HiPlex amplicon design optimization starts with choosing the set of target sequences (e.g. candidate genes) for which primers need to be designed. To ensure primer and/or gRNA specificity by SMAP design, genome sequences with high sequence similarity should be included in the set of reference sequences. One straightforward approach is to use precomputed gene families such as provided by the comparative genomics platform PLAZA. Alternatives are to group target genes by homology group, pathway, interpro domain, or other shared sequence features (e.g. domain repository).


Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_target_selection.png

SMAP target-selection is run on a reference genome FASTA file, a genome annotation GFF file and a geneID list (optionally grouped) to extract and reorient candidate (gene) sequences before further downstream analysis such as read mapping, SMAP design, SMAP haplotype-sites, SMAP haplotype-window and SMAP effect-prediction. SMAP target-selection is run to create HiPlex designs.

Guidelines for the selection of reference sequences

  • A reference sequence FASTA file should include all target regions (e.g. sets of candidate genes, grouped by gene family or genetic pathway).

  • In case sets of candidate genes are used as targets, the reference should include as many as possible paralogous sequences (or any other region with high sequence homology, BLAST hit, pseudogenes, etc.) to ensure primer specificity and minimal off-target primer binding. Precomputed gene families such as those retrieved from comparative genomics platforms as PLAZA are ideal for this.

  • All sequences in the reference sequence FASTA file should encode candidate genes on the positive strand (CDS orientation) to facilitate compatibility with downstream analysis (see Commands & options: --selectGenes).

  • The GFF file should contain at least the genome_region and the CDS features of all target regions. The coordinates of the features should correspond to their respective sequence in the reference sequence FASTA file. The GFF file can contain surplus target regions not present in the reference sequence FASTA file.

  • The reference sequence FASTA and corresponding GFF file can be extracted from a reference genome sequence using SMAP target-selection.

Commands & options

Reference gene sets in GFF and FASTA format can be extracted with the python script SMAP_target-selection.py, provided in the SMAP utility tools.

It is mandatory to specify the genome GFF and FASTA file of the species, the gene families data file and the species as positional arguments:
gff3 file ###### (str) ### Path to the gff3 file (tab-delimited) of the species containing gene, CDS, and exon features with positions relative to the fasta file [no default].
fasta file ##### (str) ### Path to the FASTA file containing the genomic sequence of the species [no default].
Example from PLAZA: ath.fasta.gz
gene families data file ##### (str) ### Path to the gene family information file (tab-delimited) for the (coding) genes, separated per gene family type [no default].
Example from PLAZA: genefamily_data.HOMFAM.csv.gz
species ##### (str) ### Species, corresponding with species indicated in the gene family info file. [no default].
Example: ath.

The gene families data file can be used to group genes by homology group, pathway, interpro domain, etc., by listing the group_id in the first column of the file, species and gene_id in the second and third column, respectively, and together with the list of ‘group_id’s’ given with the option -f, --hom_groups

#group_id

species

gene_id

PATHWAY1

ath

AT5G03800

PATHWAY1

ath

AT5G52850

PATHWAY1

ath

AT5G06540

PATHWAY1

ath

AT1G10330

PATHWAY2

ath

AT3G05240

PATHWAY2

ath

AT3G28640

PATHWAY2

ath

AT4G04370

PATHWAY2

ath

AT5G42450

PATHWAY2

ath

AT4G39530

PATHWAY2

ath

AT1G33350

It is mandatory to specify a list with homology groups of interest or a list with genes of interest:
-f, --hom_groups ###### (str) ### Path to the list with homology groups of interest [no default and given list with genes is used].
-g, --genes ######### (str) ### Path to the list with genes of interest [no default and given list with homology groups is used].
Optionally, a flanking region can be extracted upstream and downstream of the target gene:
-r, --region ######### (int) ### Region to extend the FASTA sequence of the genes of interest on both sides with the given number of nucleotides or with the maximum possible [default: 0 or enter a positive value].

Options may be given in any order.

Example commands

Command to run the script with specified GFF and FASTA file, gene families data file, species, region and list with genes of interest:

python3 SMAP_target-selection.py /path/to/gff /path/to/fasta /path/to/gene_family_info ath --region 500 --genes /path/to/gene_list

Command to run the script with specified GFF and FASTA file, gene families data file, species, region and list with homology groups of interest:

python3 SMAP_target-selection.py /path/to/gff /path/to/fasta /path/to/gene_family_info ath --region 500 --hom_groups /path/to/hom_list
Once the FASTA and GFF files are obtained, SMAP design is run with these files and optionally with a gRNA file. SMAP design first filters the gRNAs from the list and generates amplicons on the reference sequences. See further description under SMAP design.
If the script is run in the directory where the input files are, then it is still required to denote the path/to/ as “./<file>”, otherwise the script likely attempts to place the output at the root directory, possibly generating the error: permission denied for “/<output_file>”.

Output