Scope & Usage

Scope

SMAP haplotype-window extracts haplotypes from reads aligned to a predefined set of Windows in a reference sequence, wherein each Window is enclosed by a pair of Border regions.
SMAP haplotype-window can be used for highly multiplex amplicon sequencing (HiPlex) or Shotgun sequencing data.
SMAP haplotype-window extracts an entire DNA sequence between two Borders as a haplotype allele, without prior knowledge of polymorphisms within that sequence, and considers any unique DNA sequence as a haplotype. This is different from SMAP haplotype-sites, which performs read-backed haplotyping using prior positional information of read alignments and creates multi-allelic haplotypes from a concatenated short string of polymorphic sites (ShortHaps).
../_images/SMAP_haplotype_window_SQS_short_v2.png
In the SMAP haplotype-window workflow, the user first selects Windows (loci to be haplotyped) enclosed by pairs of Border regions. Then, for each BAM file and each Window, SMAP haplotype-window extracts the ID’s of reads that overlap with the respective Windows with at least one nucleotide. Using the list of read-IDs, a new temporary FASTQ file is created for each sample-Window combination. Then, for each sample-Window FASTQ file, the corresponding Border sequences are used for pattern-match read trimming with Cutadapt. The remaining read sequences per Window are considered haplotypes, which are counted and listed in an integrated haplotype table per sample and per Window.

SMAP haplotype-window considers the entire read sequence spanning the region between the Borders as haplotype.
SMAP haplotype-window filters out genotype calls of Windows with low read depth and low frequency haplotypes to control for noise in the data.
SMAP haplotype-window creates a multi-allelic haplotype table listing haplotype counts and frequencies per Window, per sample, across the sample set.
SMAP haplotype-window plots the haplotype frequency distribution across all Windows per sample, and the distribution of haplotype diversity (number of distinct haplotypes per locus) across the sample set.
SMAP haplotype-window can transform the haplotype frequency table into multi-allelic discrete genotype calls.

Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_window.png

SMAP haplotype-window requires this input:

  1. a FASTA file with the reference sequence. Typically, whole genome reference sequences are used for Shotgun sequencing data, while a reference consisting of selected candidate genes may be created by SMAP target-selection for HiPlex data.

  2. a GFF file with the coordinates of pairs of borders that enclose a window to define the locus positions, created with SMAP sliding-frames for Shotgun data or SMAP design for HiPlex data.

  3. a set of FASTQ files with preprocessed reads that need to be haplotyped. Any number of samples may be given and will be processed in parallel.

  4. a set of BAM files made with BWA-MEM using the respective reference sequence and FASTQ files.

  5. optional: a FASTA file containing the gRNA sequences, created by SMAP design, in case CRISPR was performed by stable transformation with a CRISPR/gRNA delivery vector, see also CRISPR.

Genotype call tables created by SMAP haplotype-window may further be analysed with SMAP grm (for HiPlex and Shotgun data), or with SMAP effect-prediction (for HiPlex data).

Commands & options

The program has four mandatory positional arguments and multiple optional arguments.

Mandatory options for SMAP haplotype-window:

genome ################### (str) ### FASTA file with the reference genome sequence. The first positional mandatory argument after smap haplotype-window [no default]. borders ################## (str) ### GFF file with the coordinates of pairs of Borders that enclose a Window. Must contain NAME=<> in column 9 to denote the Window name. The second positional mandatory argument after smap haplotype-window [no default]. alignments_dir ############# (str) ### Path to the directory containing BAM and BAM index (BAI) files. All BAM files should be in the same directory. The third positional mandatory argument after smap haplotype-window. Should be indicated by absolute path or at least a “.” [default current directory]. sample_dir ################ (str) ### Path to the directory containing FASTQ files with the reads mapped to the reference genome to create the BAM files. The FASTQ file names must have the same prefix as the BAM files specified in alignments_dir. The fourth positional mandatory argument after smap haplotype-window [no default].

optional arguments:

--write_sorted_sequences ############# Write FASTQ files containing the reads for each Window in a separate file per input sample [default off].
-p, --processes ############ (int) ### Number of parallel processes [1].
--memory_efficient ################# Reduces the memory load significantly, but increases time to calculate results.
-o, --out ################ (str) ### Basename of the output file without extension [“”].
-h, --help ###################### Show the full list of options. Disregards all other parameters.
-v, --version #################### Show the version. Disregards all other parameters.
--debug ######################### Enable verbose logging. Provides additional intermediate output files used for sample-specific QC.

Optional arguments may be given in any order.

Example commands

Pools

smap haplotype-window /path/to/RefGenome.fasta/ /path/to/borders.gff/ /path/to/BAM_directory/ /path/to/FASTQ_directory/ --min_read_count 30 -f 2 -m 1 -u "" -p 8 --min_distinct_haplotypes 2

Individuals

smap haplotype-window /path/to/RefGenome.fasta/ /path/to/borders.gff/ /path/to/BAM_directory/ /path/to/FASTQ_directory/ --min_read_count 10 --discrete_calls dominant --frequency_interval_bounds 10 -f 5 -u "" -p 8 --min_distinct_haplotypes 2

Output

Tabular output

By default, SMAP haplotype-window will return two .tsv files.

haplotype counts

counts_cx_fx_mx.tsv (with x the value per option used in the analysis) contains the read counts (-c) and haplotype frequency (-f) filtered and/or masked (-m) read counts per haplotype per locus as defined in the BED file from SMAP delineate. This is the file structure:

Reference

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1

Window_1

ACGTCGTCGC

60

13

34

Chr1

Window_1

ACGTCGTCAC

19

90

51

Chr1

Window_2

GCTCATCG

70

63

87

Chr1

Window_2

GCTCTCG

108

22

134

relative haplotype frequency

haplotypes_cx_fx_mx.tsv contains the relative frequency per haplotype per locus in sample (based on the corresponding count table: counts_cx_fx_mx.tsv). The transformation to relative frequency per locus-sample combination inherently normalizes for differences in total number of mapped reads across samples, and differences in amplification efficiency across loci. This is the file structure:

Reference

Locus

Haplotypes

Sample1

Sample2

Sample..

Chr1

Window_1

ACGTCGTCGC

0.76

0.13

0.40

Chr1

Window_1

ACGTCGTCAC

0.24

0.87

0.60

Chr1

Window_2

GCTCATCG

0.39

0.74

0.39

Chr1

Window_2

GCTCTCG

0.61

0.26

0.61

Additionally freqs_unfiltered.tsv can be further filtered using the options -j (minimum distinct haplotypes) and -k (maximum distinct haplotypes), resulting in the file freqs_distinct_haplotypes_filter.tsv