.. raw:: html

    <style> .purple {color:purple} </style>

.. role:: purple

.. raw:: html

    <style> .white {color:white} </style>

.. role:: white


#############
Scope & Usage
#############

Scope
-----

**SMAP design** creates highly multiplex amplicon sequencing (HiPlex) primers and/or gRNA panels for genotyping CRISPR/Cas-induced variation or natural genetic variation in a genepool. The designs can be highly customised.

HiPlex is a cost-effective method for targeted sequencing of multiple genomic loci and identification of genome sequence diversity, including naturally occurring genetic variation in genepools and CRISPR/Cas-induced mutations.
Mutation screens can be upscaled by multiplexing (loci) and/or pooling (samples) at various levels of the experimental design, and further help to reduce effort and cost of library preparation and sequencing, while increasing the coverage of the genomic targets, maintaining sensitivity for rare alleles, specificity of amplification, and assignment of detected allelic variants to their respective loci.

.. image:: ../images/design/Design_overview_scope.png

While screening for natural variation and CRISPR/Cas-induced mutations rely on the same techniques, specific parameters need to be considered for respective purposes. Nevertheless, if all parameters are optimized in a single integrated design, a HiPlex primer assay can be developed that allows for combined screening of materials across diverse and complementary sources.

Integration in the SMAP workflow
--------------------------------

.. image:: ../images/design/SMAP_global_scheme_home_design.png

**SMAP design** is run on a reference sequence FASTA file with candidate genes, and associated GFF file with gene annotations created by :ref:`SMAP target-selection <SMAP_target_selection_index>` and using precomputed gene families (*e.g.* obtained from `PLAZA <https://bioinformatics.psb.ugent.be/plaza/>`_), optionally with gRNA file obtained by third-party software (*e.g.* `CRISPOR <http://crispor.tefor.net/>`_ or `FlashFry <https://github.com/mckennalab/FlashFry>`_), and before :ref:`SMAP haplotype-sites <SMAPhaploindex>` or :ref:`SMAP haplotype-window <SMAPwindowindex>`.  
**SMAP design** is run to create HiPlex designs. BED files with locus positions for SMAP haplotype-sites may be compared with :ref:`SMAP compare <SMAPcompindex>`.

Function of SMAP design
-----------------------
| **SMAP design** integrates all :ref:`design rules <smapdesignrules>` and user-defined :ref:`selection criteria <SMAPdesignSummaryCommand>` and performs multiplex primer design, optionally combined with gRNA selection, and scales from single genes to thousands of genes.
| **SMAP design** takes as input reference sequences of the target regions (FASTA and GFF) and designs highly specific non-overlapping amplicons to cover the target site or to cover one or more gRNAs (generated by *e.g.* `CRISPOR <http://crispor.tefor.net/>`_ or `FlashFry <https://github.com/mckennalab/FlashFry>`_).
| **SMAP design** ensures that the primers can not misprime on the other target genes included in the reference sequences provided.

| The gRNAs are selected from a list of pre-computed gRNAs based on:

* its sequence: only gRNAs without a poly-T stretch (≥4T; a Pol III termination signal) and without BsaI or BbsI restriction site are retained.
* its specificity: only gRNAs with a minimal *user-defined* MIT score are retained (default 80%).
* its position within the target region: only gRNAs targeting a *user-defined* central segment of the CDS, or specific critical domain, are retained.
* its position within the amplicon: only gRNAs that are positioned at minimal *user-defined* distance from both forward and reverse primer are retained.

| In the end, **SMAP design** will have generated for each gene of interest either a set of non-overlapping amplicons to cover the genes as much as possible (to screen for natural variation), or, if a gRNA file is given, a *user-defined* maximum number of amplicons per gene each covering a *user-defined* maximum number of gRNAs (for CRISPR/Cas experiments).

Input of SMAP design
--------------------

:purple:`Focus on candidate genes`

| **SMAP design** requires at least a FASTA file with the genes of interest and a corresponding GFF file, which contains at least the CDS features.

* As **SMAP design** is conceived in the context of identification and/or creation of sequence variation in candidate genes, it is highly recommended to work with sets of candidate genes, whereby each gene is represented by a separate reference sequence in the FASTA file (the genomic sequence, **not** the CDS/transcript sequence as then intron sequences are lacking), and a GFF describing the gene model according to those reference sequence coordinates.  
* It is **not** recommended to work with chromosome-scale sequences (whole genome assemblies). This is because naming conventions used by SMAP sequentially number amplicons and gRNAs according to the sequenceID in the FASTA file, and some downstream applications (such as :ref:`SMAP effect-prediction <SMAPeffectindex>`), require gene models to be defined on the positive strand, and can only interpret data in "separate gene - separate reference sequence" format.
* Therefore, :ref:`SMAP target-selection <SMAP_target-selection_usage>` facilitates easy extraction of sets of target sequences for **SMAP design**, such as candidate genes.
* **SMAP target-selection** uses a list of candidate geneIDs (or gene family IDs) and a genome GFF file to extract the corresponding sequences from a reference genome sequence FASTA file, orients all sequences with the CDS on the forward strand, and provides a new GFF with respective gene feature coordinates. Ideally, such precomputed lists of candidate genes are obtained from comparative genomics databases such as `PLAZA <https://bioinformatics.psb.ugent.be/plaza/>`_.
* Including all gene family members of candidate genes into the reference sequence (FASTA) for primer and gRNA design ensures that alternative genomic sequences with the highest sequence similarity (*i.e.* the most likely off-target binding sequences for primers and gRNAs) have been considered for specificity checks during the design phase.

| The output of **SMAP target-selection** can immediately be used as input for **SMAP design**.
| Optionally, a set of gRNAs for each target gene can be given. The output of gRNA identification tools `CRISPOR <http://crispor.tefor.net/>`_ and `FlashFry <https://github.com/mckennalab/FlashFry>`_ can immediately be used as input for **SMAP design**. Other gRNA design programs can be used but the output will likely have to be adapted to be compatible input for **SMAP design** (see :ref:`next section <SMAPDesigngRNA>`)

.. _SMAPDesigngRNA:

Guidelines for gRNA design with CRISPOR, FlashFry, or other
-----------------------------------------------------------
| gRNA design can be performed with third-party software such as `CRISPOR <http://crispor.tefor.net/>`_ or `FlashFry <https://github.com/mckennalab/FlashFry>`_

* gRNA sequences are provided to **SMAP design** as a TSV file with header (the first line of the gRNA file is skipped so a header is necessary but arbitrary).
* If the gRNAs are designed by **CRISPOR** or **FlashFry** the column order should be as shown in the respective examples (both formats contain 12 columns).
* By default **SMAP design** will assume the gRNAs are in the FlashFry format. Otherwise, the user should set ``--gRNAsource CRISPOR`` or ``--gRNAsource other``.
* FlashFry should be run with the following scoring metric parameter to obtain the desired output for **SMAP design**: ``--scoringMetrics doench2014ontarget,doench2016cfd,hsu2013``.
* **SMAP design** uses the specificity score (and to a lesser degree the efficiency score) to rank the gRNAs. Other scoring metrics can be used if desired (*e.g.* replacing the MIT score by the CFD score).
* Note that the Doench score in the **FlashFry** output ranges from 0 to 1 (not 1 to 100 as for **CRISPOR**)

.. tabs::

   .. tab:: FlashFry command line

           Basic commands to run **FlashFry** ::

                #Install FlashFry
                wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar

                #Create off-target database
                
                mkdir tmp
                java -Xmx4g -jar FlashFry-assembly-1.15.jar index -tmpLocation ./tmp -database Arabidopsis_HOM0001 -reference Arabidopsis_HOM0001.fasta -enzyme spcas9ngg  
                
                #Discover gRNAs in reference sequences  
                
                java -Xmx4g -jar FlashFry-assembly-1.15.jar discover --database Arabidopsis_HOM0001 --fasta Arabidopsis_HOM0001.fasta --output Arabidopsis_HOM0001_guides.fasta.off_targets  
                
                #Create scores per gRNA  
                
                java -Xmx4g -jar FlashFry-assembly-1.15.jar score --input Arabidopsis_HOM0001_guides.fasta.off_targets --output Arabidopsis_HOM0001_guides.fasta.off_targets.scores --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013  --database Arabidopsis_HOM0001  

   .. tab:: FlashFry
    
          | FlashFry gRNA file

          .. csv-table:: Example of the first rows of the gRNA file generated by **FlashFry**
             :delim: tab
             :file: ../tables/design/WNK_FlashFry_gRNA.tsv
             :header-rows: 1

   .. tab:: CRISPOR command line

           Basic commands to run **CRISPOR**. Please note that this is a **python2** script. ::

                # install CRISPOR
                git clone https://github.com/maximilianh/crisporWebsite.git
                cd crisporWebsite

                # install BWA and a few required python modules
                pip install bwa matplotlib biopython numpy scikit-learn==0.16.1 pandas twobitreader

                # create a directory called genomes and a directory within genomes called *e.g.* Ath (to download the genome for Arabidopsis thaliana)
                # download your genome of interest from http://crispor.tefor.net/genomes/
                mkdir -p genomes/Ath
                cd genomes/Ath
                wget -r -l1 --no-parent -nd  --reject 'index*' --reject 'robots*' http://crispor.tefor.net/genomes/ensAraTha/

                # if you want a genome that is not in the database (or not the right version) you need to use the crisprAddGenome script in the 'tools' directory
                # This procedure is however a bit more complex and is explained on the github page of CRISPOR (https://github.com/maximilianh/crisporWebsite) under "Adding a genome"

                # run crispor.py as such: python2 crispor.py organism inFile.fasta outFile.tsv.
                # organism is Ath in our example (Ath), which refers to the genome in the directory 'genomes'.
                # inFile.fasta is a FASTA file with your genes of interest.
                # outFile.tsv is the name of the output file.
                cd ../../Out
                python2 ../crispor.py Ath ../In/HOM04D000265_ath.fasta HOM04D000265_ath_output.tsv

   .. tab:: CRISPOR

          | CRISPOR gRNA file

          .. csv-table:: Example of the first rows of the gRNA file generated by **CRISPOR**
             :file: ../tables/design/WNK_CRISPOR_gRNA.csv
             :header-rows: 1

   .. tab:: Other

            | `CRISPOR <http://crispor.tefor.net/>`_ and `FlashFry <https://github.com/mckennalab/FlashFry>`_ both have the ability to calculate certain specificity and efficiency scores as seen in the examples below: MITscore, nr of off-Targets, Doench, Out-of-Frame score (OOF), etc.
            | It is not strictly necessary for **SMAP design** that the gRNAs are generated by either CRISPOR or FlashFry.

                * if the gRNAs are generated by an other program (*e.g.* Geneious) this should be indicated by the ``--gRNAsource other`` parameter.
                * gRNA sequences are provided as a TSV file with header (first line of the gRNA file is skipped so header is necessary but arbitrary).
                * Columns should be provided in this order: GeneID, gRNAseq, MITscore, nr of off-Targets, Doench, Out-of-Frame score (OOF)
                * If no specificity or efficiency scores are available, this should be indicated as 'NA'.
                * The MIT score, Doench score and OOF score should be between 1 and 100.
                * **SMAP design** uses the specificity score (and to a lesser degree the efficiency score) to rank the gRNAs. Other scoring metrics can be used if desired (*e.g.* replacing the MIT score by the CFD score).

            | The tabs below show some examples of gRNA input files.

            .. tabs::

               .. tab:: No scores

                      | No scores

                      .. csv-table:: Example of a gRNA file without specificity or efficiency scores
                         :file: ../tables/design/WNK_NoScores_gRNA.csv
                         :header-rows: 1

               .. tab:: MIT and OOF score

                      | MIT and OOF score

                      .. csv-table:: Example of a gRNA file with the MIT and OOF scores
                         :file: ../tables/design/WNK_MITOOF_gRNA.csv
                         :header-rows: 1

               .. tab:: MIT, offTargets, Doench, OOF

                      | MIT, offTargets, Doench, OOF

                      .. csv-table:: Example of a gRNA file with the MIT, offTargets, Doench, and OOF scores
                         :file: ../tables/design/WNK_MITdoenchOOFoffTargets_gRNA.csv
                         :header-rows: 1

----

.. _SMAPdesignfilter:

.. _SMAPdesignSummaryCommand:

Commands & options
------------------

**SMAP design** has two mandatory positional arguments and multiple optional arguments.

:purple:`Mandatory options for SMAP design`:

| If **SMAP design** is run to generate amplicons for natural variation screening only a FASTA and GFF file is needed. Both files can be obtained using **SMAP target-selection**.
| If **SMAP design** is run to generate gRNAs flanked by primers, a gRNA file should be provided as well.


-  ``FASTA file`` :white:`#####` *(str)* :white:`###` Path to the FASTA file containing all genes to screen. Genes are ideally all oriented with their coding sequence in forward orientation [no default].
-  ``GFF file`` :white:`######` *(str)* :white:`###` Path to the GFF3 file with at least the CDS features with positions relative to the FASTA file [no default].

A gRNA file can be provided with the *-g* or *\-\-gRNAfile* option:

 ``-g or --gRNAfile`` :white:`#####` *(str)* :white:`##` Path to the gRNA file.

Basic command to run **SMAP design** with default parameters::

    python3 SMAPdesign.py genes.fasta genes.gff
    or
    python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv


See tabs below for specific options. Options may be given in any order.

.. tabs::

    .. tab:: General options

          | ``-o``, ``--output`` :white:`######.` *(str)* :white:`###` Basename for the outputfiles [SMAPdesign].
          | ``-sg``, ``--selectGenes`` :white:`#########` Path to text file containing one gene name per line. These gene names refer to the names used in the FASTA file. If this option is used, only designs will be done for the genes listed in the text file. The other genes in the FASTA file, not listed in the text file, will still be used to check for mispriming by Primer3.
          | ``-d``, ``--distance`` :white:`#####` *(int)* :white:`###` Minimum number of bases between the gRNA and primer [15].
          | ``-b``, ``--borderLength`` :white:`##.` *(int)* :white:`###` The length of the borders [10]. The borders are used for downstream analysis by SMAP haplotype-window.
          | ``-v``, ``--verbose`` :white:`############.` Verbose, list which target is being processed as the program progresses.
          | ``--version`` :white:`###############..` Show the version. Disregards all other parameters.

    .. tab:: gRNA options

          | ``-g``, ``--gRNAfile`` :white:`##########` *(str)* :white:`###` Path to the gRNA file.
          | ``-gs``, ``--gRNAsource`` :white:`########` *(str)* :white:`###` Program used to generate the gRNAs, either 'CRISPOR', 'FlashFry', or 'other' [FlashFry].
          | ``-ng``, ``--numbergRNAs`` :white:`#######.` *(int)* :white:`###` Maximum number of gRNAs to retain per amplicon [2].
          | ``-go``, ``--gRNAoverlap`` :white:`#######.` *(int)* :white:`###` Minimum number of bases between the start of two adjacent gRNAs [5].
          | ``-t``, ``--threshold`` :white:`##########` *(int)* :white:`###` Minimum gRNA MIT score allowed. gRNAs with a score lower than the threshold are discarded [80].
          | ``-gl``, ``--gRNAlabel`` :white:`################` Label the gRNAs (gRNA1, gRNA2, gRNA3, ...) from left to right instead of from best to worst [the latter is default, which is based on specificity scores].
          | ``-tr5``, ``--targetRegion5`` :white:`#####` *(float)* :white:`###` The fraction of the coding sequence that cannot be targeted by the gRNAs at the 5' end as indicated by a float between 0 and 1 [0.2].
          | ``-tr3``, ``--targetRegion3`` :white:`#####` *(float)* :white:`###` The fraction of the coding sequence that cannot be targeted by the gRNAs at the 3' end as indicated by a float between 0 and 1 [0.2].
          | ``-tsr``, ``--targetSpecificRegion`` :white:`#` *(str)* :white:`###` Only target a specific region in the gene indicated by the feature name in the GFF file. In this case, options ``-tr5`` and ``-tr3`` are ignored.
          | ``-prom``, ``--promoter`` :white:`#########` *(str)* :white:`###` Give the last 6 bases of the promoter that will be used to express the gRNA. This will be taken into account when checking for BsaI or BbsI sites in the gRNA. By default the U6 promoter is used [TGATTG].
          | ``-scaf``, ``--scaffold`` :white:`#########.` *(str)* :white:`###` Give the first 6 bases of the scaffold that will be used. This will be taken into account when checking for BsaI or BbsI sites in the gRNA [GTTTTA].
          | ``-pT``, ``--polyT`` :white:`#############` *(str)* :white:`###` Minimum number of repeated Ts (in a poly-T) in the gRNA to avoid [4].
          | ``-rs``, ``--restrictionSite`` :white:`######` *(str)* :white:`###` Do not filter out gRNAs that contain a BsaI or BbsI restriction site. By default, gRNAs containing a BsaI or BbsI restriction site are filtered out because they interfere in cloning the gRNA with Golden Gate cloning.

    .. tab:: Amplicon options

          | ``-na``, ``--numberAmplicons`` :white:`#############.` *(int)* :white:`##` The maximum number of non-overlapping amplicons per gene in the output [2].
          | ``-minl``, ``--minimumAmpliconLength`` :white:`#######..` *(int)* :white:`##` The minimum length of the amplicons in base pairs [120].
          | ``-maxl``, ``--maximumAmpliconLength`` :white:`#######..` *(int)* :white:`##` The maximum length of the amplicons in base pairs [150].
          | ``-ga``, ``--generateAmplicons`` :white:`############.` *(int)* :white:`##` Number of amplicons to generate per gene by Primer3. The more amplicons are designed by Primer3 the longer the run will be but the more choice there is to select for amplicons. To generate 50 amplicons per 1000 bases per gene enter -1 [150].
          | ``-pmlm``, ``--primerMaxLibraryMispriming`` :white:`###....` *(int)* :white:`##` The maximum allowed weighted similarity of a primer with any sequence in the target gene set (Primer3 setting) [12].
          | ``-ppmlm``, ``--primerPairMaxLibraryMispriming`` :white:`#.` *(int)* :white:`##` The maximum allowed sum of similarities of a primer pair (one similarity for each primer) with any single sequence in the target gene set (Primer3 setting) [24].
          | ``-pmtm``, ``--primerMaxTemplateMispriming`` :white:`###....` *(int)* :white:`##` The maximum allowed similarity of a primer to ectopic sites in the template (Primer3 setting) [12].
          | ``-ppmtm``, ``--primerPairMaxTemplateMispriming`` :white:`#.` *(int)* :white:`##` The maximum allowed summed similarity of both primers to ectopic sites in the template (Primer3 setting) [24].
          | ``-al``, ``--ampliconLabel`` :white:`######################.` Number the amplicons (Amplicon1, Amplicon2, Amplicon3, ...) from left to right instead of from best to worst (which is based on number and specificity scores of the gRNAs it overlaps with).
          | ``-mpa``, ``--misPrimingAllowed`` :white:`##################..` Do not check for mispriming in the gene set when designing primers. By default Primer3 will not allow primers that can prime at other target genes (*i.e.* other genes in the FASTA file).
          | ``-rpd``, ``--restrictPrimerDesign`` :white:`#################` This option will restrict primer design in large introns, increasing the speed of amplicon design, especially useful for genes with large introns such as human genes. 
          | ``-hp``, ``--homopolymer`` :white:`########################` The minimum number of repeated identical nucleotides in an amplicon to be discarded, *e.g.* if this parameter is set to 8, amplicons containing a homopolymer of 8 As (-...AAAAAAAA...-), Ts, Gs, or Cs or more will not be used [10].
          | ``-psp``, ``--preSelectedPrimers`` :white:`##################` Give a set of amplicons/primers for which you want to find gRNAs. The primers should be given in a GFF with feature names Primer_forward and Primer_reverse. The forward primer should occur just above the corresponding reverse primer in the GFF file.
          | 
          | See :ref:`Example -rpd <SMAPdesignexrpd>` and :ref:`Example -psp <SMAPdesignexpsp>` for more info.

    .. tab:: Extra output files options

          | ``-smy``, ``--summary`` :white:`######` Write summary file and plot of the output statistics.
          | ``-sf``, ``--SMAPfiles`` :white:`#####` Write three additional files for downstream analysis with SMAP: a BED file with SMAPs for downstream analysis with SMAP haplotype-sites; and a GFF file with borders and a gRNA FASTA file for SMAP haplotype-window.
          | ``-aa``, ``--allAmplicons`` :white:`###` Write additional GFF, primer and gRNA file with all amplicons and their respective gRNAs after filtering per gene.
          | ``-db``, ``--debug`` :white:`########` Write additional GFF file with all amplicons designed by Primer3 and all gRNAs before filtering.

----

Output
------

.. tabs::

   .. tab:: Standard output

        By default, **SMAP design** provides:

       .. tabs::
           .. tab:: standard output

                  | tabular files:

                  * a primer file (TSV file with the gene ID, primer ID and primer sequence).
                  * a gRNA file (TSV file with the gene ID, gRNA ID and gRNA sequence).
                  * a GFF file containing the selected primer and gRNA features (and all other features present in the genome annotation GFF file).
           
           .. tab:: primer file

                  | primer file

                  .. csv-table::
                     :file: ../tables/design/WNK_SMAPdesign_primers.csv
                     :widths: 20, 40, 40

           .. tab:: gRNA file

                  | gRNA file

                  .. csv-table::
                     :file: ../tables/design/WNK_SMAPdesign_gRNAs.csv
                     :widths: 20, 40, 40

           .. tab:: GFF file

                  | GFF file

                  .. csv-table::
                     :file: ../tables/design/WNK_SMAPdesign_gff3.csv


   .. tab:: Optional output

        | Optionally, **SMAP design** also provides:
        
        * a summary file (total number of amplicons designed by Primer3, total number of gRNAs designed, number of gRNAs after filtering).
        * a summary plot showing the number of amplicons and gRNAs that were designed per gene.
        * a SMAP (BED) file that is needed as input for downstream analysis with **SMAP haplotype-sites**.
        * a border (GFF) file that is needed as input for downstream analysis with **SMAP haplotype-window**.
        * a gRNA (FASTA) file that is needed as input for downstream analysis with **SMAP haplotype-window**.
        * a debug file (GFF file) containing all amplicons designed by Primer3 and all gRNAs from the input list before filtering.
        * 'all-amplicons' files (GFF file, a primer file and gRNA file) containing all amplicons with their respective gRNAs (not just the non-overlapping amplicons in the output files).

       .. tabs::
           .. tab:: summary  ``-smy``

               .. tabs::
               
                   .. tab:: summary file

                          | Summary file

                          .. csv-table:: Example of the first rows of the **summary file**
                             :file: ../tables/design/WNK_SMAPdesign_summary.csv
                             :header-rows: 1

                   .. tab:: summary plot (complete design)

                          | Summary plot

                          .. image:: ../images/design/WNK_SMAPdesign_summary_plot.png

                          | This is an example of a summary plot that SMAP design generates for a run of 11 Arabidopsis genes. The box at the top shows general info on the run. In this example, 2 non-overlapping amplicons were requested per gene each with a maximum of 2 gRNAs. In total, SMAP design generated 22 amplicons.

                          | The top left bar graph (Non-overlapping amplicons per gene) shows the number of genes in function of the number of amplicons. In this case, for all 11 genes 2 amplicons were designed. No genes had 0 or 1 amplicon.

                          | The top right bar graph (gRNAs per gene) shows the number of genes in function of the number of gRNAs. In this case, for all 11 genes 4 gRNAs were given (2 gRNAs * 2 amplicons per gene).

                          | The middle left graph (gRNAs per amplicon) shows the number of amplicons in function of the number of gRNAs. In this case, all of the 22 amplicons that were designed covered 2 gRNAs.

                          | The middle right graph shows the underlying reason for not retaining amplicons. Four cases are possible: 1) no gRNAs were designed for the gene or no gRNAs passed the filters; 2) no amplicons were designed by Primer3; 3) the gene is too short to design any amplicons of the desired length; 4) there is no overlap between the gRNAs and amplicons.
                          | In this example, SMAP design was successful for all 11 genes, which is why the graph is empty.

                          | The lower graph (Amplicons with and without gRNAs) shows the number of amplicons that were designed per gene. The grey bar shows the amplicons that were designed by Primer3 that did not overlap with any gRNAs (and are therefore discarded). The green bar shows the number of amplicons designed by Primer3 which did overlap with at least 1 gRNA. By adding both grey and green bars, the total number of amplicons designed by Primer3 per gene can be obtained. In this example a maximum of 150 amplicons was requested. The black points show the length of the gene in basepairs.

                          | This is an example of a perfect design, meaning that for each gene 2 amplicons were designed each covering 2 gRNAs as requested. For more examples see :ref:`Example Data <SMAPdesignex>`.

                   .. tab:: summary plot (partial design)

                          | Summary plot

                          .. image:: ../images/design/SMAPdesign_summaryPlot.png

                          | This is an example of a summary plot that SMAP design generates for a run of 34 human genes. The box at the top shows general info on the run. In this example, 2 non-overlapping amplicons were requested per gene each with a maximum of 2 gRNAs. In total, SMAP design generated 45 amplicons.
                          |
                          | The top left bar graph (Non-overlapping amplicons per gene) shows the number of genes in function of the number of amplicons. In this case, for **10** genes **no** amplicons were created; for **3** genes, 1 amplicon was created; for **21** genes, 2 amplicons were created.
                          |
                          | The top right bar graph (gRNAs per gene) shows the number of genes in function of the number of gRNAs. In this case, for **10** genes **no** gRNAs were created; for **3** genes, **2** gRNAs were created; for **21** genes, **4** gRNAs were created (2 gRNAs * 2 amplicons per gene).
                          |
                          | The middle left graph (gRNAs per amplicon) shows the number of amplicons in function of the number of gRNAs. In this case, all of the **45** amplicons that were designed covered **2** gRNAs.
                          |
                          | The middle right graph shows the underlying reason for not retaining amplicons in the **10** genes for which design failed completely. Four cases are possible: 1) no gRNAs were designed for the gene or no gRNAs passed the filters; 2) no amplicons were designed by Primer3; 3) the gene is too short to design any amplicons of the desired length; 4) there is no overlap between the gRNAs and amplicons.
                          | If SMAP design was successful for all genes, the graph is empty.
                          |
                          | The lower graph (Amplicons with and without gRNAs) shows the number of amplicons that were designed per gene, prior to filtering and gRNA overlap. The grey bar shows the amplicons that were designed by Primer3 that did not overlap with any gRNAs (and are therefore discarded). The green bar shows the number of amplicons designed by Primer3 which did overlap with at least 1 gRNA. By adding the grey and green bars, the total number of amplicons designed by Primer3 per gene can be obtained. In this example a maximum of 150 amplicons per gene was requested (default). The black points show the length of the gene in basepairs.
                          |
                          | This is an example of a 'partial' design where is was not possible to design primers and/or guides for all targeted genes. Additional rounds of design may be attempted to retrieve amplicons for the 10 genes that now failed. For more examples see :ref:`Example Data <SMAPdesignex>`.

           .. tab:: SMAPfiles  ``-sf``

               .. tabs::

                   .. tab:: SMAP BED file

                          | SMAP BED file

                          .. csv-table::
                             :delim: tab
                             :file: ../tables/design/WNK_SMAPdesign_SMAPs_bed.tsv
                             :header-rows: 0

                   .. tab:: border GFF file

                          | border GFF file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_borders_gff3.csv
                             :header-rows: 0

                   .. tab:: gRNA FASTA file

                          | gRNA FASTA file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_gRNAs_fasta.csv
                             :header-rows: 0

           .. tab:: allAmplicons  ``-aa``

               .. tabs::

                   .. tab:: allAmplicons border file

                          | allAmplicons border file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_allAmplicons_borders_gff3.csv
                             :header-rows: 0

                   .. tab:: allAmplicons GFF3 file

                          | allAmplicons GFF3 file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_allAmplicons_gff3.csv
                             :header-rows: 0

                   .. tab:: allAmplicons gRNA file

                          | allAmplicons gRNA file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_allAmplicons_gRNAs.csv
                             :header-rows: 0

                   .. tab:: allAmplicons gRNA FASTA file

                          | allAmplicons gRNA FASTA file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_allAmplicons_gRNAs_fasta.csv
                             :header-rows: 0

                   .. tab:: allAmplicons primers file

                          | allAmplicons primers file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_allAmplicons_primers.csv
                             :header-rows: 0

                   .. tab:: allAmplicons SMAP BED file

                          | allAmplicons SMAP BED file

                          .. csv-table::
                             :delim: tab
                             :file: ../tables/design/WNK_SMAPdesign_allAmplicons_SMAPs_bed.tsv
                             :header-rows: 0

           .. tab:: debug  ``-db``

               .. tabs::

                   .. tab:: debug file

                          | debug file

                          .. csv-table::
                             :file: ../tables/design/WNK_SMAPdesign_debug_gff3.csv
                             :header-rows: 0

Example usage
----------------
Here are a few example instructions to install and use **SMAP design** (and the tools to generate input files for **SMAP design**) using the example files in the Gitlab repo.

Please check out the :ref:`tutorial <SMAPtutorial_smap_design_potato>` for more details.

To obtain the FASTA and GFF file using **SMAP target-selection** (output files created by the code below can be found in the samples folder)::

    # install SMAP target-selection
    git clone git@gitlab.com:truttink/smap.git
    cd smap/utilities

    # you can find the files that are used here in the samples directory of the SMAP design repo (if you use the exact command as shown below, copy those file into the utilities folder)
    # run SMAP target-selection to obtain the FASTA and GFF files of a few gene families
    python3 SMAP_target-selection.py Ath.gff3 ath.con dicot_genefamily_data.hom.csv ath -f Arabidopsis_homology_groups.txt

To obtain the gRNA file *e.g.* for the WNK gene family (HOM04D000265) using FlashFry (output file created by the code below can be found in the samples folder)::

    # install FlashFry
    wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar

    #Create off-target database
    mkdir tmp
    java -Xmx4g -jar FlashFry-assembly-1.15.jar index -tmpLocation ./tmp -database ath_database -reference ath.con -enzyme spcas9ngg

    #Discover gRNAs in reference sequences
    java -Xmx4g -jar FlashFry-assembly-1.15.jar discover --database ath_database --fasta Arabidopsis_WNK_family.fasta --output Arabidopsis_WNK_gRNAs.output

    #Create scores per gRNA
    java -Xmx4g -jar FlashFry-assembly-1.15.jar score --input Arabidopsis_WNK_gRNAs.output --output Arabidopsis_WNK_family_gRNAs_FlashFry.tsv --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013  --database ath_database


Using the output from **SMAP target-selection** and FlashFry, **SMAP design** can be run as follows::

    # install SMAP design
    git clone git@gitlab.com:ilvo/smap-design.git
    pip install primer3-py biopython pandas numpy matplotlib gffutils
    cd smap-design/smap_design

    # run SMAP design to screen for natural variation (*i.e.* without gRNAs)
    # request a maximum of 100 non-overlapping amplicons per WNK gene, verbose, create a summary file and plot, create a SMAP BED file, and a border GFF file. 
    python3 SMAPdesign.py ../samples/Arabidopsis_WNK_family.fasta ../samples/Arabidopsis_WNK_family.gff -na 100 -v -smy -sf

    # run SMAP design to screen for CRISPR/Cas-induced gene edits (*i.e.* with gRNAs)
    # request a maximum of 3 non-overlapping amplicons per WNK gene, verbose, create a summary file and plot, create a SMAP BED file, a border GFF file and a gRNA fasta file. 
    python3 SMAPdesign.py ../samples/Arabidopsis_WNK_family.fasta ../samples/Arabidopsis_WNK_family.gff -g ../samples/Arabidopsis_WNK_family_gRNAs_FlashFry.tsv -na 3 -v -smy -sf

Command to run **SMAP design** with specified FASTA and GFF file, a gRNA file, output name "MAP3K_SMAPdesign_output", a text file with a selection of genes to perform design on, and a minimum distance between primer and gRNA of 20 bases::

    python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv -o MAP3K_SMAPdesign_output -sg geneSelection.txt -d 20

Command to run **SMAP design** with a gRNA file from `CRISPOR <http://crispor.tefor.net/>`_, output name "MAP3K_SMAPdesign_output", verbose, maximum 1 gRNA per amplicon, an MIT threshold of 90, targeting the complete gene::

    python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv, -gs CRISPOR --output MAP3K_SMAPdesign_output -v -ng 1 -t 90 -tr5 0 -tr3 0

Command to run **SMAP design** with a gRNA file from neither CRISPOR nor FlashFry (but *e.g.* from CHOPCHOP), 3 gRNAs per amplicon, 2 amplicons per gene, amplicons of length 400 - 800 bp, a primer-gRNA distance of 150 bp, not checking for mispriming between target genes, targeting only the first half of the genes, labeling amplicons and gRNAs from left to right and a minimum distance of 10 bases between adjacent gRNAs::

    python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv -gs other -ng 3 -na 2 -minl 400 -maxl 800 -d 150 -mpa -tr5 0 -tr3 0.5 -gl -al -go 10

Command to run **SMAP design** with a gRNA file from FlashFry, only targeting the kinase domains, with an adapted promoter, labeling the gRNAs from left to right, giving a summary, SMAP, borders and gRNA file, allAmplicons file and debug file::

    python3 SMAPdesign.py genes.fasta genes.gff -g gRNAs.tsv -tsr kinase -prom GTGGCA -gl -smy -sf -aa -db