Scope & Usage

Scope

SMAP haplotype-window extracts haplotypes from reads aligned to a predefined set of Windows in a reference sequence, wherein each Window is enclosed by a pair of Border regions.
SMAP haplotype-window can be used for highly multiplex amplicon sequencing (HiPlex) or Shotgun sequencing data.
SMAP haplotype-window extracts an entire DNA sequence between two Borders as a haplotype allele, without prior knowledge of polymorphisms within that sequence, and considers any unique DNA sequence as a haplotype. This is different from SMAP haplotype-sites, which performs read-backed haplotyping using prior positional information of read alignments and creates multi-allelic haplotypes from a concatenated short string of polymorphic sites (ShortHaps).

In the SMAP haplotype-window workflow, the user first selects Windows (loci to be haplotyped) enclosed by pairs of Border regions. Then, for each BAM file and each Window, SMAP haplotype-window extracts the ID’s of reads that overlap with the respective Windows with at least one nucleotide. Using the list of read-IDs, a new temporary FASTQ file is created for each sample-Window combination. Then, for each sample-Window FASTQ file, the corresponding Border sequences are used for pattern-match read trimming with Cutadapt. The remaining read sequences per Window are considered haplotypes, which are counted and listed in an integrated haplotype table per sample and per Window.

SMAP haplotype-window considers the entire read sequence spanning the region between the Borders as haplotype.
SMAP haplotype-window filters out genotype calls of Windows with low read depth and low frequency haplotypes to control for noise in the data.
SMAP haplotype-window creates a multi-allelic haplotype table listing haplotype counts and frequencies per Window, per sample, across the sample set.
SMAP haplotype-window plots the haplotype frequency distribution across all Windows per sample, and the distribution of haplotype diversity (number of distinct haplotypes per locus) across the sample set.
SMAP haplotype-window can transform the haplotype frequency table into multi-allelic discrete genotype calls.

Integration in the SMAP workflow

../_images/SMAP_global_scheme_home_window.png

SMAP haplotype-window requires this input:

a FASTA file with the reference sequence. Typically, whole genome reference sequences are used for Shotgun sequencing data, while a reference consisting of selected candidate genes may be created by SMAP target-selection for HiPlex data.

a GFF file with the coordinates of pairs of borders that enclose a window to define the locus positions, created with SMAP sliding-frames for Shotgun data or SMAP design for HiPlex data.

a set of FASTQ files with preprocessed reads that need to be haplotyped. Any number of samples may be given and will be processed in parallel.

a set of BAM files made with BWA-MEM using the respective reference sequence and FASTQ files.

optional: a FASTA file containing the gRNA sequences, created by SMAP design, in case CRISPR was performed by stable transformation with a CRISPR/gRNA delivery vector, see also CRISPR.

Genotype call tables created by SMAP haplotype-window may further be analysed with SMAP grm (for HiPlex and Shotgun data), or with SMAP effect-prediction (for HiPlex data).

Commands & options

The program has four mandatory positional arguments and multiple optional arguments.

Mandatory options for SMAP haplotype-window:

genome ################### (str) ### FASTA file with the reference genome sequence. The first positional mandatory argument after smap haplotype-window [no default]. borders ################## (str) ### GFF file with the coordinates of pairs of Borders that enclose a Window. Must contain NAME=<> in column 9 to denote the Window name. The second positional mandatory argument after smap haplotype-window [no default]. alignments_dir ############# (str) ### Path to the directory containing BAM and BAM index (BAI) files. All BAM files should be in the same directory. The third positional mandatory argument after smap haplotype-window. Should be indicated by absolute path or at least a “.” [default current directory]. sample_dir ################ (str) ### Path to the directory containing FASTQ files with the reads mapped to the reference genome to create the BAM files. The FASTQ file names must have the same prefix as the BAM files specified in alignments_dir. The fourth positional mandatory argument after smap haplotype-window [no default].

optional arguments:

--write_sorted_sequences ############# Write FASTQ files containing the reads for each Window in a separate file per input sample [default off].
-p, --processes ############ (int) ### Number of parallel processes [1].
--memory_efficient ################# Reduces the memory load significantly, but increases time to calculate results.
-o, --out ################ (str) ### Basename of the output file without extension [“”].
-h, --help ###################### Show the full list of options. Disregards all other parameters.
-v, --version #################### Show the version. Disregards all other parameters.
--debug ######################### Enable verbose logging. Provides additional intermediate output files used for sample-specific QC.

Optional arguments may be given in any order.

-–guides ################## (str) ### Optional FASTA file containing the sequences from gRNAs used in CRISPR/Cas genome editing. Useful when amplicons on the CRISPR/gRNA delivery vector are included in the HiPlex amplicon mixture. The most optimal name for the gRNA is Gene_amplicon_gRNA (e.g. AT1G01234_1_gRNA_001).

-q, --min_mapping_quality #### (int) ### Minimum BAM mapping quality to retain reads for analysis [30].
-c, --min_read_count ####### (int) ### Minimum total number of reads per locus per sample [0].
-d, --max_read_count ####### (int) ### Maximum number of reads per locus per sample, read depth is calculated after filtering out the low frequency haplotypes (-f) [inf].
-f, --min_haplotype_frequency # (int) ### Set minimum haplotype frequency (in %) to retain the haplotype in the genotyping table. Haplotypes above this threshold in at least one of the samples are retained. Haplotypes that never reach this threshold in any of the samples are removed [0].
-j, --min_distinct_haplotypes # (int) ### Set minimum number of distinct haplotypes per locus across all samples. Loci that do not fit this criterium are removed from the final output [0].
-k, --max_distinct_haplotypes # (int) ### Set maximum number of distinct haplotypes per locus across all samples. Loci that do not fit this criterium are removed from the final output [inf].
--max_error ############# (float) ### The maximum error rate (between 0 and 1; but not exactly 1)  for finding the border sequences in the reads. An error rate of 0 refers to exact match [0].

Optional arguments may be given in any order.

-m, --mask_frequency ####### (float) ## Mask haplotype frequency values below this threshold for individual samples. Can be used to mask noise.  Haplotype frequency values below -m are set to -u. Haplotypes are not removed removed from the genotype table based on this value, use --min_haplotype_frequency for this purpose instead.
-u, --undefined_representation # (str) ### Value to use for non-existing or masked data [NaN].
--cervus ########### Create genotype table in the format that can be used as input for Cervus parental analysis [default off].

--plot ### (all, summary, nothing) ## Select which plots are generated. Choosing “nothing” disables plot generation. Passing “summary” only generates graphs with information for all samples, while “all” will also generate per-sample plots [default “summary”].

-t, --plot_type ##### (png, pdf) ## Choose the file type for the plots [png].

This option is primarily supported for diploids and tetraploids. Users can define their own custom frequency bounds for species with a higher ploidy, but this requires optimization based on the observed haplotype frequency distributions.

-e, –-discrete_calls ### (str) ### Set to “dominant” to transform haplotype frequency values into presence(1)/absence(0) calls per allele, or “dosage” to indicate the allele copy number.

-i, --frequency_interval_bounds ## Frequency interval bounds for classifying the read frequencies into discrete calls. Custom thresholds can be defined by passing one or more space-separated integers or floats which represent relative frequencies in percentage. For dominant calling, one value should be specified. For dosage calling, an even total number of four or more thresholds should be specified. Defaults are used by passing either “diploid”, “triploid”, or “tetraploid”. The default value for dominant calling (see discrete_calls argument) is 10, regardless if “diploid”, “triploid” or “tetraploid” is used. For dosage calling, the default for diploids is “10 10 90 90”, for triploids “12.5 12.5 50.0 50.0 87.5 87.5”, and for tetraploids “12.5 12.5 37.5 37.5 62.5 62.5 87.5 87.5”.

-z, --dosage_mask ### (int) ### Mask dosage calls in the loci for which the total dosage call for a given locus at a given sample differs from the defined value. For example, in diploid organisms the total dosage call must be 2, in triploid organisms the total dosage call must be 3, and in tetraploids the total dosage call must be 4 [default no masking].

--locus_correctness ######## (int) ### Threshold value: % of samples with locus correctness. Create a new GFF file defining only the loci that were correctly dosage called (-z) in at least the defined percentage of samples [default no filtering].

--frequency_interval_bounds in practical examples and additional information on the dosage mask:

discrete dosage calls for diploids (0/1/2)

Use this option if you want to customize discrete calling thresholds. Haplotype calls with frequency below the lowerbound percentage are considered “not detected” and receive dosage `0´. Haplotype calls with a frequency between the lowerbound and the next percentage are considered heterozygous and receive haplotype dosage `1´. Haplotype calls with frequency above the upperbound percentage are considered homozygous and scored as haplotype dosage `2´. default <10, [10:90], >90 . Should be written with spaces between percentages, percentages may be written as floats or as integers [10 10 90 90].

e.g. --discrete_calls dosage --frequency_interval_bounds 10 10 90 90 translates to: haplotype frequency < 10% = 0, haplotype frequency > 10% & < 90% = 1, haplotype frequency > 90% = 2.