How It Works

Workflow of SMAP haplotype-window

Step 1

SMAP haplotype-window extracts haplotypes from reads aligned to a predefined set of loci, here called Windows, in a reference sequence. Each Window is enclosed by a pair of Border regions.
Border regions can be defined at primer binding sites, either with (Window 1) or without (Window 2) an off-set. Borders can be of variable length, defined by the user (typically 5-10 bp). Pairs of Borders can also be defined so that they enclose Sliding frames, for instance to process Shotgun data.

Step 2

Read mapping is used to assign reads to their corresponding locus on the reference genome.
Consider a sequenced fragment derived from a given genomic locus with a large deletion, or highly polymorphic region with multiple flanking SNPs, in the middle of the fragment.
Two flanking primers can bind the genome sequence of the sample and can amplify the fragment. Also, the two regions flanking the central polymorphism in the same read contain (near-)exact sequence similarity to the reference sequence of the genomic locus.
Mapping reads with BWA-MEM, defines which genomic locus is the origin of the sequenced fragment (the maximal exact match that seeds the alignment), and extends the alignment outwards untill a maximum number of read-reference mismatches is reached.
If read-reference alignments are truncated before the end of the read, BWA-MEM removes the unmapped region of the sequence read in the resulting BAM file (called soft-clipping).
Consequently, the sequence of a given read in the FASTQ file (before read mapping) may have a different length compared to the corresponding read in the BAM file (after mapping).
Also, the polymorphisms that caused the truncation of the read alignment are no longer present in the BAM file (not as alignment, not as FASTQ sequence data), and can not be used to detect polymorphisms by direct read-reference alignment comparison.

Step 3

For each Window, SMAP haplotype-window will define the Window region in the reference genome by pairing Border regions defined in a GFF file.
For each BAM file and for each Window, SMAP haplotype-window will identify the IDs of reads that overlap with at least one nucleotide for a given Window, retrieve their original complete read sequence from the corresponding sample’s FASTQ file and create a separate FASTQ file for each sample-Window combination.
These steps make sure that reads that are soft-clipped during read alignment by BWA-MEM but that initially do contain the Border sequences at their respective ends, can still be evaluated in their entirety. Soft-clipping results in partial read alignment and removal of the unmapped part of the sequence read from the BAM file.
SMAP haplotype-window then retrieves the respective sequences for the upstream Border and downstream Border regions using the GFF coordinates and the reference genome FASTA sequence for each Window.

Step 4

All separate FASTQ files (one for each sample-Window combination) are then passed to Cutadapt using the Window-specific pair of Border sequences for pattern trimming.
Because the Window is defined as the region inbetween the Borders (i.e. read regions retained after removal of the Borders), the entire read sequence spanning the Window is considered as a unique haplotype.

Step 5

These haplotypes are then counted per Window per sample, optionally filtered for (min/max) total read count per Window per sample.
All individual sample specific haplotype count tables are integrated into a large haplotype count matrix.
This procedure detects unique haplotypes in Windows enclosed by two known Border sequences consisting of any (a priori unknown) combination of InDels and/or SNPs, without using the BAM alignment itself for the detection of InDels and/or SNPs. The BWA-MEM alignment is merely used for efficiently sorting reads across the reference genome and grouping by locus.
Haplotype counts are converted into relative frequencies, which can then be filtered to remove low-frequency noise.
The final step of SMAP haplotype-window is only applicable on individuals and converts haplotype frequencies into discrete calls.
Using customizable frequency intervals, haplotype frequencies can either be transformed into dominant calls (0/1) or dosage calls (0/1/2/..).

CRISPR/Cas extension of SMAP haplotype-window

A specific extension of the SMAP haplotype-window workflow for CRISPR data can be invoked using the optional command --guides.

If CRISPR/Cas genome editing was performed by stable transformation with a CRISPR/gRNA delivery vector, then the presence of the gRNA cassette in the delivery vector may be detected in the transformed genome. Primers can be designed on the vector sequence to amplify the gRNA sequence in the gRNA expression cassette, and Border regions can be positioned directly flanking the 20 bp gRNA sequence. The haplotype of that ‘locus’ that is then detected is effectively a copy of the gRNA sequence incorporated into the transformed genome. These primers can be included in the HiPlex primer set used to screen for the genomic target loci. SMAP haplotype-window can assign gRNA vector-derived reads to the respective target loci, if the user provides a FASTA file with the target loci names as identifiers and the 20 bp gRNA as sequence. In this way, genome-edited haplotypes at genomic target loci can be detected in parallel to the gRNAs that cause them, for any number of loci and any number of samples.

../_images/smap_window_sgrna_extraction_crispr.png

Example of gRNA sequences FASTA:

>AT1G07650_1_gRNA_001

TGAAGTCGCAGAACTTAACG

>AT1G07650_1_gRNA_002

CTGAAGTCGCAGAACTTAAC

Example of output file with diverse genome-edited haplotypes at genomic target loci and corresponding gRNA. By sorting on the fourth column (Target) in any output .tsv file, it is possible to arrange all the target loci with their corresponding gRNAs. Note that the standard output of SMAP haplotype-window can be further annotated by SMAP effect-prediction, for instance as the length difference with the reference, or even with the effect of the mutation on the predicted protein, in case candidate genes are used as reference sequence.

Reference

Locus

Haplotypes

Target

Sample1

Sample2

Sample3

Sample4

Sample5

Sample6

Sample7

AT1G07650

AT1G07650_1

GAGCTCTGAAGTCGCAGAACTTAACAGGCATTGTCCCTCCAGAATTCTCTAAGCTTCGTCACCTTAAAGTTTTG

AT1G07650_1

100

100

100

100

100

100

AT1G07650

AT1G07650_2

AAGTGATAATAATTTCACTGGACCAATACCAGATTTCATCAGCAATTGGACAAGAATACTGAAACTGTACGATCTATTCTTT

AT1G07650_2

100

100

100

100

100

100

100

AT1G07650

AT1G07650_3

CTTGTGAAGCTATATGGATGTTGTGTTGAAGGAAACCAATTGATATTGGTGTATGAGTACTTGGAGAACAACTGTCTA

AT1G07650_3

100

100

100

100

100

100

100

AT2G02780

AT2G02780_1

CTTAGCTCTAATTTCATCTCTGGGAAGATTCCAGAAGAGATTGTGTCTTTGAAGAATCTGAAAAGCCTTGTTTTA

AT2G02780_1

100

100

100

100

100

100

100

AT2G02780

AT2G02780_2

CATTGAAGAACAACTCCTTTAGATCCAAGATTCCAGAACAGATCAAGAAGCTGAATAACCTTCAAAGTTTAGACCTTTCT

AT2G02780_2

100

100

100

100

100

100

100

AT2G02780

AT2G02780_3

TTTCTGAGTGTAGGACGTGTACCACAGACAATGAGATCTGCAGTAGATTGGTTTGCCACCGTACCGCGTTTTCTCCTTGGAG

AT2G02780_3

6.22

6.12

AT2G02780

AT2G02780_3

TTTCTGAGTGTAGGACGTGTACCACAGACAATGAGATCTGCAGTGATTGGTTTGCCACCGTACCGCGTTTTCTCCTTGGAG

AT2G02780_3

73.88

54.69

100

100

100

100

100

AT2G02780

AT2G02780_3

TTTCTGAGTGTAGGACGTGTACCACAGACAATGAGATCTGCAGTTGATTGGTTTGCCACCGTACCGCGTTTTCTCCTTGGAG

AT2G02780_3

19.92

39.22

AT2G16250

AT2G16250_1

AGCGTTACCGGGGACTATACCTGAGTGGTTTGGTGTTAGTTTGTTGGCATTGGAAGTATTGGATCTTAGTTCGTGT

AT2G16250_1

100

100

100

100

100

100

100

AT3G03770

AT3G03770_1

ACGCTTATACTCGACGAGAATATGTTTTCTGGTGAGTTACCCGGATTGGATTGATTCTTTGCCGAGTTTAGCTGTGTTGAGCTTGAG

AT3G03770_1

2.22

AT3G03770

AT3G03770_1

ACGCTTATACTCGACGAGAATATGTTTTCTGGTGAGTTACCGAGATTGGATTGATTCTTTGCCGAGTTTAGCTGTGTTGAGCTTGAG

AT3G03770_1

9.46

15.87

AT3G03770

AT3G03770_1

ACGCTTATACTCGACGAGAATATGTTTTCTGGTGAGTTACCGATTGGATTGATTCTTTGCCGAGTTTAGCTGTGTTGAGCTTGAG

AT3G03770_1

10.81

23.62

AT3G03770

AT3G03770_1

ACGCTTATACTCGACGAGAATATGTTTTCTGGTGAGTTACCGGATTGGATTGATTCTTTGCCGAGTTTAGCTGTGTTGAGCTTGAG

AT3G03770_1

72.06

39.84

100

100

100

100

100

AT3G03770

AT3G03770_1

ACGCTTATACTCGACGAGAATATGTTTTCTGGTGAGTTACCGTGATTGGATTGATTCTTTGCCGAGTTTAGCTGTGTTGAGCTTGAG

AT3G03770_1

7.66

18.44

AT3G03770

AT3G03770_2

TGGATCTTTCCTACAATACATTCGTTGGTCCATTTCCTACTTCCTTGATGTCTCTTCCCGCGATAACTTACTTGAACA

AT3G03770_2

100

100

100

100

100

100

100

AT3G03770

AT3G03770_3

TTTTCATTGTTCTTTTCAGATTTATAGAGGGAGATTGAAAGATGGATCCTTTGATGGCAATTAGGTGTCTGAA

AT3G03770_3

7.94

14.58

AT3G03770

AT3G03770_3

TTTTCATTGTTCTTTTCAGATTTATAGAGGGAGATTGAAAGATGGATCCTTTGTGGCAATTAGGTGTCTGAA

AT3G03770_3

42.09

9.28

100

100

100

100

100

AT3G03770

AT3G03770_3

TTTTCATTGTTCTTTTCAGATTTATAGAGGGAGATTGAAAGATGGATCCTTTGTTGGCAATTAGGTGTCTGAA

AT3G03770_3

50

76.19

AT3G08680

AT3G08680_1

TATCCCTCCCGTCCTTTCTCATCGCCTCGTTAATCTTGATCTCTCCGCCAACTCGCTCTCGGGAAACATTCCCACGAGTCTACAA

AT3G08680_1

100

100

100

100

100

100

100

AT3G08680

AT3G08680_2

TCTCCTTCACCGACAACACCAACAGAAGGTCCTGGTACAACCAATATAGGTCGTGGTACCGCCAAAAAAGTTCTCTCCACTGGTGCTATTGT

AT3G08680_2

100

100

100

100

100

100

100

AT3G08680

AT3G08680_3

AGAAGTAGCAGCTGGAAAGAGGGAGTTCGAGCAGCAGATGGAAGCCGTGGGAAGGATCAGTCCACACGTGAAT

AT3G08680_3

1.14

1.3

AT3G08680

AT3G08680_3

AGAAGTAGCAGCTGGGAAACGGGAGTTCGAGCAGCAGATGGAAGCCGTGGGAAGGATCAGTCCACACGTGAAT

AT3G08680_3

100

100

98.88

100

100

100

98.75

AT3G13065

AT3G13065_1

TCTCGTTTTCTGTTTCTGTAGAAAAGTATCTGGACGTGGACTCAGTGGATCTTTAGGTTACCAGCTTGGAAA

AT3G13065_1

100

100

100

100

100

100

100

AT3G24660

AT3G24660_1

GTTTTGCCTCCATCGATTTGGAACCTCTGTGATAAGCTTGTCTCTTTCAAGATTCATGGTAATAACTTGTCTGGGGTTTTGCC

AT3G24660_1

100

100

100

100

100

100

100

AT3G24660

AT3G24660_2

GATTTTGGTGAGTCAAAGTTTGGAGCAGAATCTTTCGAAGGGAACAGTCCTAGCCTTTGTGGTTTGCCTTTGAAGC

AT3G24660_2

100

100

100

100

36.25

100

100

AT3G24660

AT3G24660_2

GATTTTGGTGAGTCAAAGTTTGGAGCAGAATCTTTCGAAGGGAACAGTCCTAGGACTGCATGGTATCTTTGTGGTTTGCCTTTGAAGC

AT3G24660_2

63.81

AT3G51740

AT3G51740_3

GATTCGTCATCAGAATCTTCTTGCACTAAGAGCTTACTACTTAGGACCTAAAGGAGAGAAGCTTCTCGTCTTTGATTACATGCCTAAAAGC

AT3G51740_3

5.02

3.42

4.57

4.63

6.12

3.97

6.25

AT3G51740

AT3G51740_3

GATTCGTCATCAGAATCTTCTTGCACTAAGAGCTTACTACTTAGGACCTAAAGGAGAGAAGCTTCTCGTCTTTGATTACATGTCTAAAGGA

AT3G51740_3

95.06

96.56

95.44

95.38

93.81

88.06

88.38

AT3G51740

AT3G51740_3

GATTCGTCATCAGAATCTTCTTGCACTAAGAGCTTACTACTTAGGACCTAAAGGGAGAGAAGCTTCTCGTCTTTGATTACATGTCTAAAGGA

AT3G51740_3

7.94

5.36

AT5G20690

AT5G20690_2

TCCTCGGTACCAGAAACCTCGAAACAAGGCCGCAATAAACGCAATCATGGTGTCGATATCGCTTCTTCTCCTGTTCTTTATT

AT5G20690_2

25.41

25.73

AT5G20690

AT5G20690_2

TCCTCGGTACCAGAAACCTCGAACAAGGCCGCAATAAACGCAATCATGGTGTCGATATCGCTTCTTCTCCTGTTCTTTATT

AT5G20690_2

100

100

74.62

74.25

100

100

100

CRISPR_Vector

gRNA

ACTCATACACCAATATCAAT

AT1G07650_3_gRNA_006

22.55

24.22

CRISPR_Vector

gRNA

ACAATGAGATCTGCAGTGAT

AT2G02780_3_gRNA_077

45.09

46.81

CRISPR_Vector

gRNA

TTTGTTGGCATTGGAAGTAT

AT2G16250_1_gRNA_080

61.53

CRISPR_Vector

gRNA

TTTCTGGTGAGTTACCGGAT

AT3G03770_1_gRNA_109

54.88

53.19

CRISPR_Vector

gRNA

AAAGGCAAACCACAAAGGCT

AT3G24660_2_gRNA_142

38.5

CRISPR_Vector

gRNA

ACGAGAAGCTTCTCTCCTTT

AT3G51740_3_gRNA_150

29.88

30.45

CRISPR_Vector

gRNA

GGTACCAGAAACCTCGAACA

AT5G20690_2_gRNA_213

23

21.62

CRISPR_Vector

gRNA

TAGAAGACGTTGACCATGCT

gRNA_1

40.25

39.34