How It Works

Here, we describe the technical implementation of SMAP effect-prediction.


Preparing input files with SMAP design and SMAP haplotype-window

SMAP design

During the SMAP design workflow, SMAP target-selection first selects one or more genes with a short flanking upstream and downstream region, and a GFF file is created with the positional information of the exons that together make up the protein coding sequence (CDS). Importantly, SMAP target-selection makes sure that all sequences are oriented in the direction of the protein coding gene. Genes encoded on the - strand in the reference genome sequence are reverse complemented, and all coordinates of the CDS are automatically reversed. If custom created FASTA and GFF files are provided to SMAP effect-prediction, and CDS features are positioned on the - strand in the GFF file with the structural gene annotation, SMAP effect-prediction will indicate which genes have errors, and should be reverse completented prior to running SMAP haplotype-window and SMAP effect-prediction. If the CDS of a particular gene does not begin with a START codon (ATG), or end with a STOP codon (TAA, TAG or TGA), SMAP effect-prediction will indicate which genes have errors that should be corrected in the reference sequence FASTA file or the structural gene annotation GFF file.

../_images/HIW_gene.png

Then, SMAP design identifies amplicons and gRNAs in the CDS of the gene. One or more amplicons may be designed per gene, each containing one or more gRNAs. A minimal spacing between the gRNA(s) and the primer binding sites are respected to allow for editing at some distance from the PAM site without affecting the primer binding sites, thus retaining the capacity to amplify the genomic region containing induced mutations.

The coordinates of gRNAs and primers with respect to the selected reference gene sequence are listed in GFF files for downstream analysis. Border sequences are defined as the last 10 nucleotides at the 3’ end of the forward and reverse primers for delineation of haplotype sequences by SMAP haplotype-window.

../_images/HIW_design.png

The sequences of the gRNAs are used to synthesize gRNAs and clone into expression vectors to drive CRISPR/Cas in plant materials. The primers are used for HiPlex sequencing of potentially edited lines.

SMAP haplotype-window

SMAP haplotype-window analyses reads aligned to the target loci and extracts the exact DNA sequence inbetween the upstream and downstream border sequences per locus, thus listing all observed haplotypes per locus across a collection of potential mutants. SMAP haplotype-window also calculates relative haplotype frequency per locus per sample. This genotyping table is the input for SMAP effect-prediction.

../_images/HIW_window.png

The subsequent steps of SMAP effect-prediction

Next, we describe the seven subsequent steps of SMAP effect-prediction.

Step 1. Collect

Collect all positional and sequence information needed to predict the encoded protein for each haplotype

SMAP effect-prediction collects the following information from files prepared by the other modules of SMAP:

  1. the gene sequence from the reference genome; SMAP target-selection extracts the gene sequence and places it with the CDS on the + strand orientation in the reference FASTA file used for SMAP.

  2. the position of the CDS regions in the reference sequence; SMAP target-selection calculates the correct positions of the CDS with respect to the extracted gene reference sequence of 1).

  3. the position of the amplicon(s) in the gene reference sequence; SMAP design creates pairs of primers for HiPlex sequencing of genomic DNA, and stores the relative position of the corresponding border regions in a GFF file.

  4. the position of gRNA(s) for CRISPR/Cas genome editing within an amplicon; SMAP design optionally creates one or more gRNAs per amplicon to induce mutations in a particular position of the reference genome.

  5. the collection of haplotypes per locus and their relative frequencies per sample; SMAP haplotype-window extracts haplotypes (exact DNA sequences) using the exact same reference gene coordinates as outlined in 1)-4).

Naturally occurring sequence variation

If SMAP effect-prediction is used to analyse naturally occurring sequence variation present in a broad genepool (e.g. ecotypes or breeding materials), it derives the following information per gene and amplicon:

  1. reference protein sequence (including checks for translational START and STOP codons in the reference protein).

  2. position of intron/exon junctions and splicing donor/acceptor sites.

CRISPR/Cas-induced mutations

If SMAP effect-prediction is used to analyse CRISPR/Cas-induced genome editing, positional information of the gRNA is used to also derive:

  1. gRNA sequence.

  2. expected cut-site.

  3. region of interest (ROI).

The information required (and their respective options) are illustrated in the tabs below. The difference between analysing these types of sequence variation is defined by the region of interest (ROI) in which sequence variation is expected:

../_images/HIW_collect_ROI_Nat_Var.png

In case of natural variation, the entire length of the haplotype window is considered.


Step 2. Align

Align each haplotype to its reference per locus

../_images/align_hap_a.png

Each alternative haplotype is aligned to it’s reference haplotype sequence of the corresponding locus. The reference sequence is retrieved from the coordinates of the borders of that locus provided as GFF file, and the reference sequence FASTA file. Here, the alignments of locus gene3_1 of the example data set are shown. Insertions in the alternative haplotype are shown in green, SNPs are shown in red.


Step 3. Filter and collapse

Filter for haplotypes based on location of sequence variants within the window: region of interest (ROI)

SMAP haplotype-window extracts all observed unique DNA sequences within a user-defined window and calls these haplotypes. So, artefactual haplotypes may be created by sequence variation such as read errors, imprecise trimming, etc. SMAP haplotype-window allows to filter the genotype call table based on haplotype frequency, but not on sequence content. To further eliminate artefactual haplotypes and collapse the genotype table into a simpler matrix with less haplotype complexity, SMAP effect-prediction implements an optional filter that only retains sequence variants that overlap with a given ROI.

3.1. Definition of the ROI.

../_images/HIW_collect_ROI_Nat_Var.png

Because there is no prior focus or knowledge on where naturally occurring sequence variants may be located, the ROI typically spans the entire length of the locus.

3.2. Filtering on the ROI.

../_images/HIW_collect_ROI_Nat_Var.png

Since there is no prior focus on where naturally occurring sequence variants may be located, the ROI typically spans the entire length of the locus.

3.3. Collapse the haplotype table based on identical haplotype calls in the region of interest (ROI).

After exclusion of sequence variants outside the ROI, some haplotypes may contain identical haplotype calls within the ROI.

../_images/align_haplotype_name_new_v2.png

If the user decides to ignore sequence variation outside the ROI, the algorithm ‘collapses’ corresponding haplotypes by summing their relative frequencies in the haplotype table. Accordingly, the haplotype frequency table switches to new haplotype identifiers per locus, because the exact DNA sequence that initially identified the unique haplotypes no longer correspond to the ‘collapsed’ sequences.

../_images/haplotype_collapse.png

Step 4. Annotate

Annotate the haplotype, score effects on gene structure and predicted protein sequence.

SMAP effect-prediction uses the local GFF file provided by the user to extract gene structure annotation and place the haplotype back into its genomic context. For each haplotype, the corresponding full length mutated protein sequence is reconstructed and mutations in strategic sites (START codon, splicing sites, and STOP codon) are searched. At the end of the process, the master table is extended with more columns further describing the computationally predicted effect of each haplotype on the protein. The steps to annotate the master table are as follows:

4.1. Delineate the CDS region in the window by creating a reference ‘CDS code’.

Using the local GFF file provided by the user, SMAP effect-prediction identifies the genomic areas corresponding to coding sequences (CDS). The CDS regions are encoded as 1 and the non-coding areas as 0. The reference genome sequence is thereby simplified into a string of 0’s and 1’s, here named the “CDS code”. The file with border positions is used to identify the position of the haplotype both in the genome sequence and in the CDS code. The reference and mutated haplotype are extracted from the master table and the haplotype CDS code is extracted from the CDS code using the border positions. Regions outside the borders in the genomic and CDS code sequences are kept for later stages.

../_images/code_reference_definition.png

4.2. Identify indels and modify the ‘CDS code’.

The alignment between the reference and mutated haplotypes (see Step 2 Align) is used to identify indels. The alignment is screened for gaps; insertions are represented by gaps in the aligned reference haplotype whereas deletions are shown by gaps in the aligned mutated haplotype. Indel indices are captured, and the reference haplotype CDS code is modified accordingly, so that its indexing now matches the length of the alignment and fits the mutated haplotype CDS sequence.

../_images/INDELS_gaps.png

Insertions and deletions in the haplotype CDS code are treated differently.

Insertions

For instance, if a +5 bp insertion is observed between positions 43 and 44 in the reference sequence, the haplotype CDS code will be extended by 5 characters at the same index. The mutated haplotype has a longer length compared to the reference haplotype, and so the mutated haplotype CDS code must be extended too. The rules to know which character (either 0 or 1) has to be used for extending the haplotype CDS code are as follows:

  • If positions surrounding the insertion are coding in the haplotype CDS code (i.e. 1), then the insertion is considered as coding (i.e. made of extra 1’s).

  • If positions surrounding the insertion are non-coding in the haplotype CDS code (i.e. 0), then the insertion is considered as non-coding as well (i.e. made of extra 0’s)

../_images/code_insertion_intron.png ../_images/code_insertion_exon.png
  • If one of the positions surrounding the insertion is coding (1), and the other is non-coding (0), this means that the mutation occurs right after or right before a splicing site. In both cases, the insertion is considered as coding.

../_images/code_insertion_up.png ../_images/code_insertion_down.png

Deletions

../_images/code_deletion_exon.png ../_images/code_deletion_across.png

In the case of a deletion, the indices of the deletion in the reference haplotype are simply removed from the haplotype CDS code. The mutated haplotype has a shorter length compared to the reference haplotype, and so does the mutated haplotype CDS code.

SNPs

../_images/code_SNP.png

In case of SNPs, no gaps are observed in the alignment (only mismatches). The indexing is the same between the non-aligned and aligned reference and mutated sequences so the haplotype CDS code is not modified. SNPs in coding areas are assumed to be coding and SNPs in non-coding areas are assumed to be non-coding.

4.3. Place CDS codes of reference and mutated haplotypes back into their genome context.

The mutated haplotype sequence and the mutated CDS code are placed back into their respective context. Both sequences are stitched back with the reference regions outside the borders that were kept at step 1 of the annotation process. This results in a full genome sequence with mutation in the haplotype area and a full CDS code of the same length with altered sequence in the haplotype area.

4.4. Search for mutations at translational START and STOP codons and splicing sites.

Before translating the mutated genomic sequence using the mutated CDS code, strategic sites are searched for mutations. SMAP effect-prediction will consider that any modifications at the translational START codon is a major effect. Because SMAP effect-prediction can not reliably predict translation re-initiation that might occur at a downstream alternative translational START codon, it is not possible to compute an alternative protein sequence. The resulting identity score between the reference and the mutated protein is by definition 0. Likewise, modifications at splicing sites are considered major effects. Because SMAP effect-prediction cannot reliably predict which downstream splicing donor or acceptor site will be used, the algorithm simply truncates protein translation right at the position of the mutated splicing donor or acceptor site. Finally, mutations at the translational STOP codon lead to an extended open reading frame (ORF) at the 3’ end, and the translation continues until it reaches the following STOP codon in the ORF.

4.5. Extract and stitch all CDS sequences to create a full length CDS.

The mutated protein is obtained using the mutated genomic sequence and the mutated CDS code. Coding areas are extracted and stitched together to form the full CDS. The mutated CDS is then translated, considering possible mutations at strategic sites (see step 4 of the annotation process).

../_images/Adjusted_ORF_haplotype_name_V4_zoom_exon4_translation.png

In-silico translation of the haplotypes of gene3. Amplicons (gene3_1) were projected into their respective gene context and translated in the corresponding ORF.

4.6. Align reference and mutated protein and calculate %identity score.

The reference and mutated protein are aligned. The number of identical amino acids in the alignment is computed and divided by the total length of the alignment and expressed as percentage to obtain the identity score (see also Grant lab, Girgis et al., 2021, EBI).

ref ML--IIFGLA
    ||  ||| ||
mut MLDKIIF-LA
Number of identical aligned residues: 7
Total length of the alignment: 10
%identity score: 7/10 = 0.70 = 70%

In case the original protein is much longer than the mutated protein, the two proteins are aligned globally so the length of the alignment is 16, but only 7 amino acids are identical in the alignment.

ref ML--IIFGLATLGHWS*
    ||  ||| ||
mut MLDKIIF-LA*
Number of identical aligned residues: 7
Total length of the original protein: 16
%identity score: 7/16 = 0.438 = 43.8%
../_images/Adjusted_ORF_haplotype_name_V4_translations_alignment.png

Alignment of the predicted proteins encoded by haplotypes of gene3.

4.7. Add the novel annotation columns to the haplotype frequency table.

The master table is annotated and extended with five more columns:

  1. atgCheck: whether the ROI in the haplotype contains a mutation affecting the START codon: True/False

  2. splicingSiteCheck: whether the ROI in the haplotype contains a mutation affecting a splicing site: True/False

  3. stopCodonCheck: whether the ROI in the haplotype contains a mutation affecting the STOP codon: True/False

  4. protein_sequence: the full length mutated protein sequence

  5. pairwiseProteinIdentity: the %identity score between the reference and mutated proteins as explained at 6.


Step 5. Classify loss-of-function effect of mutation on protein per haplotype: LOF effect classes

quantify the effect on gene function or activity based on the %identity score

The user can define the %identity cutoff to declare that a mutation has an effect on the protein function or activity, based on the protein identity score computed in the previous step (see step 4.6. of the annotation process). The degree of loss-of-function (LOF) can be discretized in three discrete effect classes: no or minimal effect, intermediate effect, strong effect (knockout, KO). Haplotypes leading to substantial loss of the protein sequence are expected to cause major protein disruptions and are therefore considered as loss-of-function (LOF) or knockout (KO) haplotypes. For instance, at a cutoff of 70%, haplotypes with %identity score below 70 are considered major effect mutations, whereas haplotypes with %identity score greater than 70%, are considered to not have a major effect on the protein (i.e considered as functional as reference).

../_images/haplotype_LOF_class.png

Step 6. Aggregate haplotype frequencies per locus by LOF effect class

Aggregate the frequency values of haplotypes that have at least a minimal degree of LOF: per locus, per sample

../_images/haplotype_LOF_frequency_table.png

The frequencies of LOF haplotypes are summed per locus per sample to display the relative fraction of proteins with at least a minimal degree of LOF. The resulting aggregated genotype call table contains the cumulative frequency of all haplotypes encoding a LOF protein (one value per locus per sample on a scale of 0-1; where 0 indicates all reference protein, and 1 indicates all LOF protein).


Step 7. Discretize

Transform cumulative LOF frequency to discrete genotype calls: WT, heterozygous KO, homozygous KO

../_images/Discretize_LOF_frequencies_all9genes_newFreq.png

Aggregation of KO haplotype frequencies leads to a quantitative LOF frequency distribution that is difficult to interpret in genetic analysis. The aggregated KO haplotype frequency has a W-shape distributions in diploid samples. Local maxima are located around values of aggregated KO haplotype frequency of 0, 50%, and 100% which corresponds to homozygous reference, heterozygous KO, and homozygous KO, respectively. Indeed situations where half of the reads are KO haplotypes and half are not highlight heterozygosity in a diploid organism. SMAP effect-prediction can transform the quantitative KO haplotype frequency into discrete genotype calls, homozygous reference, heterozygous, homozygous mutant, coded as 0, 1, and 2, respectively. The frequency intervals to call such genotypes can be user-defined. As a rule of thumb, we generally consider that <15% is homozygous reference, between 40% and 60% is heterozygous, and >90% is homozygous mutant. These cutoffs can be adjusted after inspection of the graphical output after a first analysis. The discretization further eases the interpretation of the table, especially in the case of Mendelian segregation in progeny, or downstream statistical analyses to associate phenotypes to genotypes. Discretization may also be performed before aggregation of haplotypes per locus so that one can associate phenotype per haplotype rather than per locus.

../_images/frequency_distribution.png