Individuals

Step 3: Filtering haplotypes and calculating haplotype frequencies

procedure

After processing all BAM and FASTQ files in the given directories and storing all observed haplotypes in a table for each sample, SMAP haplotype-window loops over all tables to create an integrated table that contains all observed haplotypes across all BAM files and collects their absolute read counts. The next step is to switch from absolute read depth per haplotype to relative read depth per haplotype per Window, i.e. to calculate haplotype frequencies. The haplotype frequency is calculated per haplotype per locus per sample (haplotype count per sample/total count per window per sample; range 0-100%).
Then each haplotype’s length is compared with the referent Window sequence’s length (haplotype length - ref Window length) and written into the column 'Length difference with reference’ or LDR. For each haplotype with an LDR = 0, the algorithm will look to see if it is an exact match with the reference Window sequence, if so, the LDR receives the value Ref instead of 0.
../../../_images/SMAP_window_step4b.png

filters

read errors create false positive haplotypes

Randomly distributed read errors erroneously create low frequency haplotypes. Example data are shown in the scheme above, see tables below for the occurence in real HiPlex data of 8 diploid pools.

false positive haplotypes are removed from the data set with a minimal frequency threshold (option -f)

Haplotypes that never reach a user-defined minimal HF threshold (default 0%) in at least one of the samples, are removed entirely from the haplotype count table. After this filter, the haplotype frequency is recalculated on the remaining read counts per haplotype per Window per sample. The effect of adjusting this filter is illustrated in real data in the tables below. Please compare the tables below at subsequent steps of filtering: -f 0 filtering is effectively no filter (all haplotype observations > 0% are kept). Most noise from low frequency haplotypes can effectively be removed with -f 1 or -f 5 . We recommend to test the effect of this parameter on your own data, however in individuals it is best to have a high -f value, as alleles/haplotypes are expected to be present in around 25% and 50% of the reads in tetraploids and diploids respectively.

Samples with false positive haplotypes are masked for those haplotypes in the data set with a minimal frequency threshold (option -m)

Haplotypes that reach the user defined -f value in at least one BAM/FASTQ file are retained, however other BAM/FASTQ files that have a HF lower than the -f value for a haplotype where at least one sample exceeded the -f are also retained. The option -m masks all HF lower than set value by substituting them with NaN’s or another value defined by --undefined_representation. The other haplotype frequencies are not recalculated.

The following tabs represent data from a real experiment. Three CRISPR/Cas9 targetted loci are shown. The number of haplotypes in -f 0 filtering is overwhelming, showcasing the importance of haplotype frequency filtering

Haplotype frequency distributions

The different tabs below show the typical haplotype frequency distributions of diploid or tetraploid individuals for HiPlex data. Vertical lines in the haplotype frequency distributions show the thresholds by which haplotype frequency is transformed into discrete call classes. The commands to run SMAP haplotype-window on these datatypes are shown below each graph.

# .. image:: ../../../images/window/tobemade

smap haplotype-window -alignments_dir /path/to/BAM/ -genome /path/to/RefGenome -borders /path/to/GFF -reads_dir /path/to/FASTQ -min_read_count 30 -f 5 -p 8 --min_distinct_haplotypes 2 --discrete_calls dominant --frequency_interval_bounds 10

Step 4: Transforming haplotype frequencies to discrete calls in individuals

procedure

If individual genotypes are analysed in, the final step is to transform observed haplotype frequencies per individual back to discrete haplotype calls using --discrete_calls. SMAP haplotype-window uses simple, user-defined haplotype frequency thresholds to define discrete genotypic classes. The multi-allelic nature of haplotype calling is retained, and the final genotype call table lists the absence/presence (0/1) or dosage (0/1/2 diploids; 0/1/2/3/4 tetraploids) of each haplotype per individual.

Next, the total count of discrete haplotypes per Locus per sample is calculated and output as a table. See examples below.

Alternatively, haplotype read depth data generated with SMAP haplotype-window may be used for genotype calling in individuals using statistical methods, for instance Clark et al. 2019.

../../../_images/SMAP_window_step6.png

filters

read errors create false positive haplotypes

Randomly distributed read errors erroneously create low frequency haplotypes. Example data are shown in the scheme above, see tables below for the occurence in real HiPlex data of 8 diploid pools.

false positive haplotypes are removed from the data set with a minimal frequency threshold (option -f)

Haplotypes that never reach a user-defined minimal HF threshold (default 0%) in at least one of the BAM files, are removed entirely from the haplotype count table. After this filter, the haplotype frequency is recalculated on the remaining read counts per haplotype per Window per BAM file. The effect of adjusting this filter is illustrated in real data in the tables below. Please compare the tables below at subsequent steps of filtering: -f 0 filtering is effectively no filter (all haplotype observations > 0% are kept). Most noise from low frequency haplotypes can effectively be removed with -f 1 or -f 5 . We recommend to test the effect of this parameter on your own data, however in individuals it is best to have a high -f value, as alleles/haplotypes are expected to be present in around 25% and 50% of the reads in tetraploids and diploids respectively.

Samples with false positive haplotypes are masked for those haplotypes in the data set with a minimal frequency threshold (option -m)

Haplotypes that reach the user defined -f value in at least one BAM/FASTQ file are retained, however other BAM/FASTQ files that have a HF lower than the -f value for a haplotype where at least one sample exceeded the -f are also retained. The option -m masks all HF lower than set value by substituting them with NaN’s or another value defined by --undefined_representation. The other haplotype frequencies are not recalculated.

Haplotype count tables

If SMAP haplotype-window is run using --discrete_calls the analysis continues by creating discrete haplotype calls per individual sample. For each sample and for each Window, haplotype frequencies are transformed to discrete calls using simple user-defined frequency thresholds based on the observed haplotype frequency spectrum.

Next, the total count of discrete haplotypes per Window per sample is calculated and output as table. See examples below.

Per diploid sample, windows with a total haplotype count different different from a set value (--dosage_filter) are removed (set to `NA´ ). The recommended value for this filter is 2 for diploids and 4 for tetraploids. The haplotype frequency (HF) is then calculated across the set of samples (count per haplotype/total haplotype count per Window * 100%). This measure identifies the haplotypes that are supported by sufficient read depth in individual genotypes, but rare across the sample set (e.g. population).

Haplotype call tables

Haplotype allele frequencies are calculated as the number of observations of a haplotype divided by the total number of haplotype observations (ploidy (or --dosage_filter value) x number of samples with observations) on that locus. In -f 0 filtering, all haplotypes are retained. It is recommended to try out different values and decide which value suits your data best.


Output

Tabular output

By default, SMAP haplotype-window will return two .tsv files.
File counts_cx_fx_mx.tsv (with x the value per option used in the analysis) contains the read depth (-c) and haplotype frequency (-f) filtered counts per haplotype per Window as defined in the GFF file.
This is the file structure:

Locus

Haplotypes

LDR

Sample1

Sample2

Sample..

Window_1

ACGTCGTCGC

ref

60

13

34

Window_1

ACGTCGTCAC

0

19

90

51

Window_2

GCTCATCG

ref

70

63

87

Window_2

GCTCTCG

-1

108

22

134

File haplotypes_cx_fx_mx.tsv contains the relative frequency per haplotype per locus in each BAM file (based on the corresponding count table: counts_cx_fx_mx.tsv, with x the value per option used in the analysis).
This is the file structure:

Locus

Haplotypes

LDR

Sample1

Sample2

Sample..

Window_1

ACGTCGTCGC

ref

0.76

0.13

0.40

Window_1

ACGTCGTCAC

0

0.24

0.87

0.60

Window_2

GCTCATCG

ref

0.39

0.74

0.39

Window_2

GCTCTCG

-1

0.61

0.26

0.61

Additionally freqs_unfiltered.tsv can be further filtered using the options -j (minimum distinct haplotypes) and -k (maximum distinct haplotypes), resulting in the file freqs_distinct_haplotypes_filter.tsv

Code

-genome ################### (str) ### FASTA file with the reference genome sequence.
–borders ################## (str) ### GFF file with the coordinates of pairs of Borders that enclose a Window. Must contain NAME=<> in column 9 to denote the Window name.
–reads_dir ################# (str) ### Path to the directory containing FASTQ files with the reads mapped to the reference genome to create the BAM files. The FASTQ file names must have the same prefix as the BAM files specified in -alignments_dir [no default].
-alignments_dir ############# (str) ### Path to the directory containing BAM and BAI alignment files. All BAM files should be in the same directory [no default].
-–guides ################## (str) ### Optional FASTA file containing the sequences from sgRNAs used in CRISPR-Cas9 genome editing. Useful when amplicons on the CRISPR-Cas9/sgRNA delivery vector are included in the HiPlex amplicon mixture.
-p, --processes ############ (int) ### Number of parallel processes [1].
-o, --out ################ (str) ### Basename of the output file without extension [SMAP_haplotype_window].
-u, --undefined_representation # (str) ### Value to use for non-existing or masked data [NaN].
-h, --help ###################### Show the full list of options. Disregards all other parameters.
-v, --version #################### Show the version. Disregards all other parameters.
--debug ######################### Enable verbose logging. Provides additional intermediate output-files used for sample-specific QC.

Options may be given in any order.