Scope & Usage

Scope

SMAP design creates highly multiplex amplicon sequencing (HiPlex) primers and/or gRNA panels for genotyping CRISPR/Cas-induced variation or natural genetic variation in a genepool. The designs can be highly customised.

HiPlex is a cost-effective method for targeted sequencing of multiple genomic loci and identification of genome sequence diversity, including naturally occurring genetic variation in genepools and CRISPR/Cas-induced mutations. Mutation screens can be upscaled by multiplexing (loci) and/or pooling (samples) at various levels of the experimental design, and further help to reduce effort and cost of library preparation and sequencing, while increasing the coverage of the genomic targets, maintaining sensitivity for rare alleles, specificity of amplification, and assignment of detected allelic variants to their respective loci.

../_images/Design_overview_scope.png

While screening for natural variation and CRISPR/Cas-induced mutations rely on the same techniques, specific parameters need to be considered for respective purposes. Nevertheless, if all parameters are optimized in a single integrated design, a HiPlex primer assay can be developed that allows for combined screening of materials across diverse and complementary sources.

Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_design.png

SMAP design is run on a reference sequence FASTA file with candidate genes, and associated GFF file with gene annotations created by SMAP target-selection and using precomputed gene families (e.g. obtained from PLAZA), optionally with gRNA file obtained by third-party software (e.g. CRISPOR or FlashFry), and before SMAP haplotype-sites or SMAP haplotype-window. SMAP design is run to create HiPlex designs. BED files with locus positions for SMAP haplotype-sites may be compared with SMAP compare.

Function of SMAP design

SMAP design integrates all design rules and user-defined selection criteria and performs multiplex primer design, optionally combined with gRNA selection, and scales from single genes to thousands of genes.
SMAP design takes as input reference sequences of the target regions (FASTA and GFF) and designs highly specific non-overlapping amplicons to cover the target site or to cover one or more gRNAs (generated by e.g. CRISPOR or FlashFry).
SMAP design ensures that the primers can not misprime on the other target genes included in the reference sequences provided.
The gRNAs are selected from a list of pre-computed gRNAs based on:
  • its sequence: only gRNAs without a poly-T stretch (≥4T; a Pol III termination signal) and without BsaI or BbsI restriction site are retained.

  • its specificity: only gRNAs with a minimal user-defined MIT score are retained (default 80%).

  • its position within the target region: only gRNAs targeting a user-defined central segment of the CDS, or specific critical domain, are retained.

  • its position within the amplicon: only gRNAs that are positioned at minimal user-defined distance from both forward and reverse primer are retained.

In the end, SMAP design will have generated for each gene of interest either a set of non-overlapping amplicons to cover the genes as much as possible (to screen for natural variation), or, if a gRNA file is given, a user-defined maximum number of amplicons per gene each covering a user-defined maximum number of gRNAs (for CRISPR/Cas experiments).

Input of SMAP design

Focus on candidate genes

SMAP design requires at least a FASTA file with the genes of interest and a corresponding GFF file, which contains at least the CDS features.
  • As SMAP design is conceived in the context of identification and/or creation of sequence variation in candidate genes, it is highly recommended to work with sets of candidate genes, whereby each gene is represented by a separate reference sequence in the FASTA file (the genomic sequence, not the CDS/transcript sequence as then intron sequences are lacking), and a GFF describing the gene model according to those reference sequence coordinates.

  • It is not recommended to work with chromosome-scale sequences (whole genome assemblies). This is because naming conventions used by SMAP sequentially number amplicons and gRNAs according to the sequenceID in the FASTA file, and some downstream applications (such as SMAP effect-prediction), require gene models to be defined on the positive strand, and can only interpret data in “separate gene - separate reference sequence” format.

  • Therefore, SMAP target-selection facilitates easy extraction of sets of target sequences for SMAP design, such as candidate genes.

  • SMAP target-selection uses a list of candidate geneIDs (or gene family IDs) and a genome GFF file to extract the corresponding sequences from a reference genome sequence FASTA file, orients all sequences with the CDS on the forward strand, and provides a new GFF with respective gene feature coordinates. Ideally, such precomputed lists of candidate genes are obtained from comparative genomics databases such as PLAZA.

  • Including all gene family members of candidate genes into the reference sequence (FASTA) for primer and gRNA design ensures that alternative genomic sequences with the highest sequence similarity (i.e. the most likely off-target binding sequences for primers and gRNAs) have been considered for specificity checks during the design phase.

The output of SMAP target-selection can immediately be used as input for SMAP design.
Optionally, a set of gRNAs for each target gene can be given. The output of gRNA identification tools CRISPOR and FlashFry can immediately be used as input for SMAP design. Other gRNA design programs can be used but the output will likely have to be adapted to be compatible input for SMAP design (see next section)

Guidelines for gRNA design with CRISPOR, FlashFry, or other

gRNA design can be performed with third-party software such as CRISPOR or FlashFry
  • gRNA sequences are provided to SMAP design as a TSV file with header (the first line of the gRNA file is skipped so a header is necessary but arbitrary).

  • If the gRNAs are designed by CRISPOR or FlashFry the column order should be as shown in the respective examples (both formats contain 12 columns).

  • By default SMAP design will assume the gRNAs are in the FlashFry format. Otherwise, the user should set --gRNAsource CRISPOR or --gRNAsource other.

  • FlashFry should be run with the following scoring metric parameter to obtain the desired output for SMAP design: --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013.

  • SMAP design uses the specificity score (and to a lesser degree the efficiency score) to rank the gRNAs. Other scoring metrics can be used if desired (e.g. replacing the MIT score by the CFD score).

  • Note that the Doench score in the FlashFry output ranges from 0 to 1 (not 1 to 100 as for CRISPOR)

Basic commands to run FlashFry

#Install FlashFry
wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar

#Create off-target database

mkdir tmp
java -Xmx4g -jar FlashFry-assembly-1.15.jar index -tmpLocation ./tmp -database Arabidopsis_HOM0001 -reference Arabidopsis_HOM0001.fasta -enzyme spcas9ngg

#Discover gRNAs in reference sequences

java -Xmx4g -jar FlashFry-assembly-1.15.jar discover --database Arabidopsis_HOM0001 --fasta Arabidopsis_HOM0001.fasta --output Arabidopsis_HOM0001_guides.fasta.off_targets

#Create scores per gRNA

java -Xmx4g -jar FlashFry-assembly-1.15.jar score --input Arabidopsis_HOM0001_guides.fasta.off_targets --output Arabidopsis_HOM0001_guides.fasta.off_targets.scores --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013  --database Arabidopsis_HOM0001

Commands & options

SMAP design has two mandatory positional arguments and multiple optional arguments.

Mandatory options for SMAP design:

If SMAP design is run to generate amplicons for natural variation screening only a FASTA and GFF file is needed. Both files can be obtained using SMAP target-selection.
If SMAP design is run to generate gRNAs flanked by primers, a gRNA file should be provided as well.
  • FASTA file ##### (str) ### Path to the FASTA file containing all genes to screen. Genes are ideally all oriented with their coding sequence in forward orientation [no default].

  • GFF file ###### (str) ### Path to the GFF3 file with at least the CDS features with positions relative to the FASTA file [no default].

A gRNA file can be provided with the -g or --gRNAfile option:

-g or --gRNAfile ##### (str) ## Path to the gRNA file.

Basic command to run SMAP design with default parameters:

python3 SMAPdesign.py genes.fasta genes.gff
or
python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv

See tabs below for specific options. Options may be given in any order.

-o, --output ######. (str) ### Basename for the outputfiles [SMAPdesign].
-sg, --selectGenes ######### Path to text file containing one gene name per line. These gene names refer to the names used in the FASTA file. If this option is used, only designs will be done for the genes listed in the text file. The other genes in the FASTA file, not listed in the text file, will still be used to check for mispriming by Primer3.
-d, --distance ##### (int) ### Minimum number of bases between the gRNA and primer [15].
-b, --borderLength ##. (int) ### The length of the borders [10]. The borders are used for downstream analysis by SMAP haplotype-window.
-v, --verbose ############. Verbose, list which target is being processed as the program progresses.
--version ###############.. Show the version. Disregards all other parameters.

Output

By default, SMAP design provides:

tabular files:
  • a primer file (TSV file with the gene ID, primer ID and primer sequence).

  • a gRNA file (TSV file with the gene ID, gRNA ID and gRNA sequence).

  • a GFF file containing the selected primer and gRNA features (and all other features present in the genome annotation GFF file).

Example usage

Here are a few example instructions to install and use SMAP design (and the tools to generate input files for SMAP design) using the example files in the Gitlab repo.

Please check out the tutorial for more details.

To obtain the FASTA and GFF file using SMAP target-selection (output files created by the code below can be found in the samples folder):

# install SMAP target-selection
git clone git@gitlab.com:truttink/smap.git
cd smap/utilities

# you can find the files that are used here in the samples directory of the SMAP design repo (if you use the exact command as shown below, copy those file into the utilities folder)
# run SMAP target-selection to obtain the FASTA and GFF files of a few gene families
python3 SMAP_target-selection.py Ath.gff3 ath.con dicot_genefamily_data.hom.csv ath -f Arabidopsis_homology_groups.txt

To obtain the gRNA file e.g. for the WNK gene family (HOM04D000265) using FlashFry (output file created by the code below can be found in the samples folder):

# install FlashFry
wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar

#Create off-target database
mkdir tmp
java -Xmx4g -jar FlashFry-assembly-1.15.jar index -tmpLocation ./tmp -database ath_database -reference ath.con -enzyme spcas9ngg

#Discover gRNAs in reference sequences
java -Xmx4g -jar FlashFry-assembly-1.15.jar discover --database ath_database --fasta Arabidopsis_WNK_family.fasta --output Arabidopsis_WNK_gRNAs.output

#Create scores per gRNA
java -Xmx4g -jar FlashFry-assembly-1.15.jar score --input Arabidopsis_WNK_gRNAs.output --output Arabidopsis_WNK_family_gRNAs_FlashFry.tsv --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013  --database ath_database

Using the output from SMAP target-selection and FlashFry, SMAP design can be run as follows:

# install SMAP design
git clone git@gitlab.com:ilvo/smap-design.git
pip install primer3-py biopython pandas numpy matplotlib gffutils
cd smap-design/smap_design

# run SMAP design to screen for natural variation (*i.e.* without gRNAs)
# request a maximum of 100 non-overlapping amplicons per WNK gene, verbose, create a summary file and plot, create a SMAP BED file, and a border GFF file.
python3 SMAPdesign.py ../samples/Arabidopsis_WNK_family.fasta ../samples/Arabidopsis_WNK_family.gff -na 100 -v -smy -sf

# run SMAP design to screen for CRISPR/Cas-induced gene edits (*i.e.* with gRNAs)
# request a maximum of 3 non-overlapping amplicons per WNK gene, verbose, create a summary file and plot, create a SMAP BED file, a border GFF file and a gRNA fasta file.
python3 SMAPdesign.py ../samples/Arabidopsis_WNK_family.fasta ../samples/Arabidopsis_WNK_family.gff -g ../samples/Arabidopsis_WNK_family_gRNAs_FlashFry.tsv -na 3 -v -smy -sf

Command to run SMAP design with specified FASTA and GFF file, a gRNA file, output name “MAP3K_SMAPdesign_output”, a text file with a selection of genes to perform design on, and a minimum distance between primer and gRNA of 20 bases:

python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv -o MAP3K_SMAPdesign_output -sg geneSelection.txt -d 20

Command to run SMAP design with a gRNA file from CRISPOR, output name “MAP3K_SMAPdesign_output”, verbose, maximum 1 gRNA per amplicon, an MIT threshold of 90, targeting the complete gene:

python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv, -gs CRISPOR --output MAP3K_SMAPdesign_output -v -ng 1 -t 90 -tr5 0 -tr3 0

Command to run SMAP design with a gRNA file from neither CRISPOR nor FlashFry (but e.g. from CHOPCHOP), 3 gRNAs per amplicon, 2 amplicons per gene, amplicons of length 400 - 800 bp, a primer-gRNA distance of 150 bp, not checking for mispriming between target genes, targeting only the first half of the genes, labeling amplicons and gRNAs from left to right and a minimum distance of 10 bases between adjacent gRNAs:

python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv -gs other -ng 3 -na 2 -minl 400 -maxl 800 -d 150 -mpa -tr5 0 -tr3 0.5 -gl -al -go 10

Command to run SMAP design with a gRNA file from FlashFry, only targeting the kinase domains, with an adapted promoter, labeling the gRNAs from left to right, giving a summary, SMAP, borders and gRNA file, allAmplicons file and debug file:

python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv -tsr kinase -prom GTGGCA -gl -smy -sf -aa -db