Design a CRISPR/Cas experiment with gRNA and HiPlex panels (potato)

Activate your virtual environment

source .venv/bin/activate

Create the reference sequences

Define the path to the download directory:

download_dir=$/PATH/TO/DOWNLOAD_DIR/

Or, create a new download dir:

cd ../
mkdir download
cd download
download_dir=$(pwd)

Download genomes and gene annotation data:

cd $download_dir
mkdir originals
cd originals
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.HOMFAM.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.ORTHOFAM.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Genomes/stu.fasta.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.all_features.stu.gff3.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.exon_features.stu.gff3.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Annotation/annotation.selected_transcript.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/IdConversion/id_conversion.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Descriptions/gene_description.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GO/go.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/InterPro/interpro.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/MapMan/mapman.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/proteome.selected_transcript.stu.fasta.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/transcripts.selected_transcript.stu.fasta.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/cds.selected_transcript.stu.fasta.gz

Unzip and change some formats for SMAP target-selection and place in SMAP_reference directory:

mkdir ../../SMAP_reference
zcat genefamily_data.HOMFAM.csv.gz | grep -v "#" > ../../SMAP_reference/genefamily_data.HOMFAM.csv
zcat genefamily_data.ORTHOFAM.csv.gz | grep -v "#" > ../../SMAP_reference/genefamily_data.ORTHOFAM.csv
zcat stu.fasta.gz > ../../SMAP_reference/stu.fasta
zcat annotation.selected_transcript.stu.csv.gz > ../../SMAP_reference/annotation.selected_transcript.stu.csv
zcat annotation.selected_transcript.all_features.stu.gff3.gz > ../../SMAP_reference/annotation.selected_transcript.all_features.stu.gff3

These files may be used later:

zcat annotation.selected_transcript.all_features.stu.gff3.gz > ../../SMAP_reference/annotation.selected_transcript.all_features.stu.gff3
zcat annotation.selected_transcript.exon_features.stu.gff3.gz > ../../SMAP_reference/annotation.selected_transcript.exon_features.stu.gff3
zcat gene_description.stu.csv.gz > ../../SMAP_reference/gene_description.stu.csv
zcat go.stu.csv.gz > ../../SMAP_reference/go.stu.csv
zcat interpro.stu.csv.gz > ../../SMAP_reference/interpro.stu.csv
zcat mapman.stu.csv.gz > ../../SMAP_reference/mapman.stu.csv
zcat proteome.selected_transcript.stu.fasta.gz > ../../SMAP_reference/proteome.selected_transcript.stu.fasta
zcat transcripts.selected_transcript.stu.fasta.gz > ../../SMAP_reference/transcripts.selected_transcript.stu.fasta
zcat cds.selected_transcript.stu.fasta.gz > ../../SMAP_reference/cds.selected_transcript.stu.fasta

Create short_list of homology groups in the SMAP_reference directory:

cd ../../SMAP_reference
echo "HOM05D000431" > shortlist_HOM.tsv

Add as many different homology groups as you like:

echo "HOM05D000432" >> shortlist_HOM.tsv

Create fasta reference files per homology group with SMAP target-selection (use ./ notation to indicate the path to this directory, otherwise the script tries to write files to the root of the system, where is has no write access rights).:

python3 ../smap/utilities/SMAP_target-selection.py ./annotation.selected_transcript.all_features.stu.gff3 ./stu.fasta ./genefamily_data.HOMFAM.csv stu --region 1000 --hom_groups ./shortlist_HOM.tsv

Troubleshooting

In some cases the GFF file contains formatting errors in a subset of the genes, that cause problems when processing the GFF file with gffutils. In this case a quick workaround is to try to extract the gff features from the candidate genes only, and use that GFF file as input for SMAP target-selection.

Create a short_list of candidate genes for the list of selected homology groups of the species of interest (the corresponding gene_IDs are listed in the third column of the file genefamily_data.HOMFAM.csv)

grep -f shortlist_HOM.tsv ./genefamily_data.HOMFAM.csv | grep "stu" | cut -f3 > genelist_HOM05D000431_HOM05D000432.txt

Extract the gff features for each of the candidate genes from the original GFF file

grep -f genelist_HOM05D000431_HOM05D000432.txt ./annotation.selected_transcript.all_features.stu.gff3 > annotation.selected_transcript.all_features.HOM05D000431_HOM05D000432.stu.gff3

Run SMAP target selection again, using the short GFF file

python3 ../smap/utilities/SMAP_target-selection.py ./annotation.selected_transcript.all_features.HOM05D000431_HOM05D000432.stu.gff3 ./stu.fasta ./genefamily_data.HOMFAM.csv stu --region 1000 --hom_groups ./shortlist_HOM.tsv

Output

SMAP target-selection will report for how many homology groups it was not possible to extract any sequence of the given species.

————————
2022-10-14 10:17
————————
* Get sequences of genes in selected homology groups in fasta format…
* 0 homology group(s) are not present in ./genefamily_data.HOMFAM.csv for species “stu” and are listed in ./shortlist_HOM_not_present.txt
* Get features of genes in selected homology groups in gff format…
* 0 homology group(s) are not present in ./genefamily_data.HOMFAM.csv for species “stu” and are listed in ./shortlist_HOM_not_present.txt
* Finished
————————
2022-10-14 10:18
————————

Identify gRNAs

Install FlashFry:

mkdir ../SMAP_design
cd ../SMAP_design
mkdir FlashFry
cd FlashFry
wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar

Concatenate all selected genes into one fasta file, and corresponding gff file:

cat ../../SMAP_reference/HOM05D000431_stu_selected_genes_extended_1000_bp.fasta ../../SMAP_reference/HOM05D000432_stu_selected_genes_extended_1000_bp.fasta > stu_selected_genes_extended_1000_bp.fasta
cat ../../SMAP_reference/HOM05D000431_stu_selected_genes_extended_1000_bp.gff ../../SMAP_reference/HOM05D000432_stu_selected_genes_extended_1000_bp.gff > stu_selected_genes_extended_1000_bp.gff

Create off-target database:

mkdir tmp
java -Xmx4g -jar FlashFry-assembly-1.15.jar index -tmpLocation ./tmp -database stu_selected_genes_extended_1000_bp -reference stu_selected_genes_extended_1000_bp.fasta -enzyme spcas9ngg

Discover gRNAs in reference sequences:

java -Xmx4g -jar FlashFry-assembly-1.15.jar discover --database stu_selected_genes_extended_1000_bp --fasta stu_selected_genes_extended_1000_bp.fasta --output stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.tsv

Create scores per gRNA:

java -Xmx4g -jar FlashFry-assembly-1.15.jar score --input stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.tsv --output stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013 --database stu_selected_genes_extended_1000_bp

Output

FlashFry creates lists of gRNAs throughout the entire length of the reference sequences, and can predict specificity and activity scores.

Create pairs of amplicons and gRNAs for all genes

Run SMAP design:

cd ../
mkdir all_genes
python3 ../smap-design/smap_design/SMAPdesign.py FlashFry/stu_selected_genes_extended_1000_bp.fasta FlashFry/stu_selected_genes_extended_1000_bp.gff -g FlashFry/stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv -o stu_all_selected_genes_extended_1000_bp_output -v -gl -na 5 -al -smy -sf -aa -db
mv stu_all_* all_genes/

Output

11:09:57
——————————————————–
Length Amplicons : 120-150 bp
Generate total amplicons : 150
Generate non-overlapping amplicons: 5
Generate gRNAs per amplicon: : 2
——————————————————–
1/11 (PGSC0003DMG400023803)
2/11 (PGSC0003DMG400025895)
3/11 (PGSC0003DMG400023636)
4/11 (PGSC0003DMG400030830)
5/11 (PGSC0003DMG400023441)
6/11 (PGSC0003DMG400026211)
7/11 (PGSC0003DMG400016156)
8/11 (PGSC0003DMG400017763)
9/11 (PGSC0003DMG400010356)
10/11 (PGSC0003DMG400013240)
11/11 (PGSC0003DMG400029315)
Elapsed time: 00:04:51.17
———————————————————

Create pairs of amplicons and gRNAs for a subset of genes

Create short_list of genes:

cd ../
echo "PGSC0003DMG400023803" > shortlist_target_genes.tsv

Add as many different homology groups as you like:

echo "PGSC0003DMG400025895" >> shortlist_target_genes.tsv
echo "PGSC0003DMG400023636" >> shortlist_target_genes.tsv

Run SMAP design:

mkdir selected_genes
python3 ../smap-design/smap_design/SMAPdesign.py FlashFry/stu_selected_genes_extended_1000_bp.fasta FlashFry/stu_selected_genes_extended_1000_bp.gff -g FlashFry/stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv -o stu_selected_genes_extended_1000_bp_output -sg shortlist_target_genes.tsv -v -gl -na 5 -al -smy -sf -aa -db
mv stu_selected_* selected_genes/

Output

SMAP design will report the parameter settings and for how many genes it was not possible to design any amplicons. The output is the same as described above, but limited to the candidate genes.

10:51:26
——————————————————–
Length Amplicons : 120-150 bp |
Generate total amplicons : 150 |
Generate non-overlapping amplicons: 5 |
Generate gRNAs per amplicon: : 2 |
——————————————————–
1/3 (PGSC0003DMG400023803)
2/3 (PGSC0003DMG400025895)
3/3 (PGSC0003DMG400023636)
../smap-design/smap_design/SMAPdesign.py:1596: UserWarning: FixedFormatter should only be used together with FixedLocator ax_ampliconGene.set_xticklabels(GeneLabel,

Elapsed time: 00:01:21.27
——————————————————–

Design a CRISPR/Cas experiment with gRNA and HiPlex panels (microalgae)

Activate your virtual environment

source .venv/bin/activate

Create the reference sequences

Define the path to the download directory:

download_dir=$/PATH/TO/DOWNLOAD_DIR/

Or, create a new download dir:

cd ../
mkdir download
cd download
download_dir=$(pwd)

Download genomes and gene annotation data:

cd $download_dir
mkdir originals
cd originals
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.HOMFAM.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.ORTHOFAM.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Genomes/stu.fasta.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.all_features.stu.gff3.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.exon_features.stu.gff3.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Annotation/annotation.selected_transcript.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/IdConversion/id_conversion.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Descriptions/gene_description.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GO/go.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/InterPro/interpro.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/MapMan/mapman.stu.csv.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/proteome.selected_transcript.stu.fasta.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/transcripts.selected_transcript.stu.fasta.gz
wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/cds.selected_transcript.stu.fasta.gz

Genetic diversity and composition analysis for wild and cultivated Coffea canephora

In this tutorial we analyse population structure and composition and establish a core collection of a wild coffee population using molecular markers derived from GBS and HiPlex NGS read data. By the end of the tutorial you are able to employ some commonly used methods (PCA, fastSTRUCTURE, genetic distance matrix) to quantify genetic diversity and detect population structure. Additionally, you are able to use the Maximization strategy and distance method to establish a core collection.

Characterization of the population structure and establishment of a core collection of wild C. canephora using GBS SNP markers

Sample set

To characterize the genetic structure of a wild coffee population, 256 coffee samples were collected over 24 forest inventory plots of 125 x 125m, covering an area of ca. 50-by-20km in the Yangambi region in the Democratic Republic of the Congo. The inventory plots were previously classified by Depecker et al. (2022), into three different forest categories:

  • 8 plots in regrowth forest

  • 7 plots in disturbed old-growth forest

  • 9 plots in undisturbed old-growth forest

Used software

GBprocesS

Before we can map the read data to a reference genome, demultiplexing of the Genotyping-By-Sequencing libraries was done using GBprocesS. GBprocesS uses a config.ini file as the configuration syntax for demultiplexing, trimming, merging forward and reverse reads, base quality filtering, and removal of reads with internal restriction sites. The structure of the config.ini file and more explanation can be found in the GBprocesS manual.

Genomic DNA of the wild coffee samples was digested with restriction enzymes PstI and MseI. Due to the large sample set, library preparation was separated over two batches, whereby the sample-specific barcodes were reused, and each batch was amplified with a different set of i5 and i7 Illumina barcodes and added to Illumina flowcell binding sites during PCR.

The cutsite remnants of the applied restriction enzymes can be found on NEB by looking up the RE recognition site. In this case the cutsite remnant of PstI is CTGCA and the cutsite remnant of MseI is TTNN. The barcoded and common adapters were based on Truseq sequencing primers, of which the sequence for trimming is stored in the GBprocesS code.

Mapping reads with BWA-MEM

After demultiplexing, the merged reads can be directly mapped onto the reference genome sequence. Here, we use the publicly available reference genome of Coffea canephora (Denoeud et al., 2014). Reads of all samples were mapped onto the reference genome using Burrow-Wheeler Aligner Maximal Exact Match (BWA-MEM)

source /path/to/python_virtual_environment/bin/activate

python3 map_BWA_parallel.py -d /path/to/merged_reads_directory/ \
                       -s .fq \
                       /path/to/reference_genome_fasta_file \
                       -o /path/to/output_directory \
                       -p 16 \
                       -l 03_map_BWA_parallel_P1.log

date

SMAP delineate

The distribution of teh mapped GBS reads can be checked by running SMAP delineate on the BAM files of the entire sample set (256 samples).

smap delineate /path/to/BAM \
-mapping_orientation ignore \
--processes 16 \
--plot all \
--plot_type png \
--min_stack_depth 4 \
--max_stack_depth 400 \
--min_cluster_depth 9 \
--max_cluster_depth 400 \
--completeness 80

SMAP delineate gives us insight into read mapping quality, locus length distribution, locus read depth distribution, the number of loci per sample (saturation curve) and locus completeness across the sample set.

../_images/StackCluster.Saturation.scatter6.png

18 644 loci with completeness >80%

SNP calling and filtering

The Unified Genotyper tool of GATK was used to perform the SNP calling by comparing the mapped reads to the reference genome of C. canephora. We used the .bed file provided by SMAP delineate to only call SNPs within the 18644 loci with completeness higher than 80% across all 256 samples.

cd /path/to/bamfiles_folder

#Make a string variable that contains all input .bam files
BAM_FILES=( "*".BWA.bam )
for i in "${BAM_FILES[@]}"
do #make a variable STRING that contains all bam files, including the "-I" option that needs to precede the sample name in the GATK command
STRING+='-I '
STRING+="$i "
done

#Run docker
docker run -w /path/to/bamfiles_folder/ \
                -v /path/in/host/to/link/:/path/in/container/to/link/ broadinstitute/gatk3:3.7-0 java \
                -jar /usr/GenomeAnalysisTK.jar \
                -R /path/to/reference_genome_fasta_file.fasta \
                -T UnifiedGenotyper \
                $STRING \
                -o /path/to/output_directory/output_name.vcf \
                -L /path/to/path_to_bed_file

Output: 239.007 SNPs

Warning: GATK will use the file name of the bam file which is defined within the bam file as sample name. By changing the name of the bam file name manually or via the command line, the sample will not be adapted in the .vcf file. Therefore, if you want to work with shorter or more informative names in further analyses, we recommend to change the samples after SNP calling in the obtained .vcf file using this customized python script and an excel file containing the original .bam file names in one collumn and the preferred names in another column.

Checking for genetically identical individuals

Before analyzing the population structure, we want to make sure that every individual is a genetically unique individual. Normally DNA of every individual was extracted and sequenced once, but possible labeling errors during plant maintenance, sampling, DNA extraction,library preparation, or sequencing, may lead to identical fingerprints across the sample set, or the identification of clonal materials in the collection. Therefore, we check if each indiviudal in the sample set is unique by calculating the pairwise Jaccard Inversed Distance (JID) between all pairs of individuals using SMAP grm.

Haplotype calling

SMAP grm uses haplotype calls of each individual to calculate the pairwise Jaccard Inversed Distance (JID). Therefore, read-backed haplotyping was conducted based on the combined variation in SMAPs (obtained by SMAP delineate) and SNPs (obtained by GATK) using the SMAP haplotype-sites: module.

smap haplotype-sites /path/to/BAM /path/to/bed /path/to/vcf \
-mapping_orientation ignore \
--partial include \
--processes 16 \
--plot_type png \
--no_indels \
--min_distinct_haplotypes 2 \
--min_read_count 10 \
--min_haplotype_frequency 2 \
--discrete_calls dosage \
--frequency_interval_bounds 10 10 90 90 \
--dosage_filter 2

Output: 22602 haplotypes within 18644 loci

Genetic similarity

SMAP grm code:

smap grm --table /path/to/haplotype_table \
--input_directory /path/to/directory/of/haplotype/table \
--output_directory /path/to/output/directory/ \
--samples /path/to/sample_ID_file \
--processes 16 \
--plot_type png \
--locus_completeness 0.1
--similarity_coefficient Jaccard \
--locus_information_criterion Shared \
--partial
--print_sample_information All \
--print_locus_information All \
--plot_format png \

Alternative: you can also calculate the JID for a subset of individuals. To do so, you only list the individuals of the subset in the Sample_ID.txt file specified by the parameter –samples. The order of the samples within the JID matrix will be specified by the order of the individuals listed in the Sample_ID.txt file. You can also give your individuals an alternative name in the JID matrix by adding a second column to the sample ID text file indicating the new names.

First, we look at the distribution of the pairwise JID values to set a threshold to identify all pairs of genetically identical samples.

../_images/JID_distribution.png

Distribution of JID values across all pairwise comparisons revealed a group of samples with pairwise JID in the range 0.974–1, and a separate group of sample pairs with pairwise Jaccard Inversed Distance ranging between 0.64 and 0.89. A JID of 1 means identical haplotype calls at all detected loci (i.e., genetically identical). In practice, a Jaccard Inversed Distance in the range of 0.974 to 1, means that one HiPlex locus out of all XX detected loci in the sample pair may display a different haplotype constitution between two samples. (for instance a single instance of a false negative or false positive detection of a haplotype, due to technical errors). Therefore, the minimal JID of 0.974 (for GBS data) was used as a threshold to identify all pairs of genetically identical samples (i.e., clones).

Second, genetically identical samples were identified using the minimal JID values of 0.974.

../_images/JID_matrix.png

Seven pairs of samples (indicated in red) had a JID higher than 0.974 and were therefore identified as genetically identical samples. As Coffea canephora is a self-incompatible species we did not expect genetically identical samples in our sample set. By looking further at the genetically identical sample pairs, we discovered some repeats that indicated errors had been made during the creation of the GBS libraries in the lab. To prevent bias from these identical pairs in the population structure analyses, one individual of each identical sample pair (7 in total) was removed from the sample set.

Investigating the genetic structure and composition

Principal component analysis

A principal component analyses performed on the filtered 3212 SNPs for the remaining 249 wild C. canephora samples, will give us a first impression of the population structure within the wild coffee population. An annotation file is needed to annotate the samples with more information, such as sample ID, sampling location, location information, etc. The order of the samples in the annotation file needs to be the same as in the .vcf file

```
R script for PCA with SNP data
```

# *Install and load required packages*

library(vcfR)
library(adegenet)
library(ape)
library(tidyverse)

# *Set working directory*

setwd("/path/to/working_directory/")
getwd()

# *Read vcf file*

vcf <- read.vcfR("path/to/.vcf")

##Check vcf file
head(vcf)
vcf@fix[1:10,1:5]

## Convert vcf to Genlight object*
gen <- vcfR2genlight(vcf)

## Add real SNP names
locNames(gen) <- paste(vcf@fix[,1],vcf@fix[,2],sep="_")

# *Perform a PCA*

pca <- glPca(gen, nf=100) #principal component analysis with 100 PCs

plot(pca$scores)

pca_scores <- as.data.frame(pca$scores)

## Check the proportion of variance explained by the first three PC axes
pca$eig[1]/sum(pca$eig)
pca$eig[2]/sum(pca$eig)
pca$eig[3]/sum(pca$eig)

# Create a dataframe with the annotation file and PCA scores

## Import annotation file
annotations <- read.delim("Annotations.txt", sep = "\t", header = TRUE)
### WARNING: The order and names of the individuals must be the same as in the VCF file.

## Create a new dataframe containing a first column wiht Sample_IDs and other columns containing the PCA scores
pca_annotated <- cbind(annotations, pca$scores[,c(1:3)])

# Plot the PCA
pca_plot <- ggplot(pca_annotated,aes(x = PC1,y = PC2, color = Plot_nr, shape = Forest_disturbance))+
geom_point(size=2.5)+
#scale_color_manual(values = c("blue", "red", "green"))+
ggtitle("Wild C. canephora population")+
xlab("PC1 (5.1%)")+
ylab("PC2 (3.2%)")+
theme_bw()
pca_plot

write.table(pca_annotated, "pca_scores.txt", sep="\t", row.names = FALSE)

png("output_file_name.png", width = 900, height = 600)
pca_plot
dev.off()
../_images/pca_plot.png

Three different clusters could be identified using the PCA analyses. All samples collected in plot 3 clustered together on the positive PC1-axis and all samples collected in plot 1 and 2 clustered together on the negative PC2-axis. All other samples (plot 4-25) clustered together on the negative PC1-PC2-axis.

As the samples are clustered together based on their geographical location, rather than forest disturbance, we have an indication that the geographical location could be a main driver of genetic structure in this wild coffee population. To support this hypothesize, we will analyze the genetic structure using the fastSTRUCTURE analysis.

fastSTRUCTURE

To visualize the genetic structure and genetic composition of the wild coffee population, fastSTRUCTURE was run using a PLINK formatted genotype call file.

vcftools --vcf /path/to/SNP_filtered_vcf_file.vcf --plink --out output_file_name

Converting .vcf file to .plink format creates a .map and .ped file.

The .map file contains information about the SNP: Chromosome name, SNP name, Position in Morgans or centiMorgans (optional) and basepair coordinates. However, when our .vcf file was converted to the .map file, the column with the Chromosome names consisted of zeros instead of the chromosome numbers. Without adjusting the first column the fastSTRUCTURE analyses will give an error. Therefore, the chromosome numbers were manually added to the .map file based on the SNP name.

The .ped file contains information about the sample set: Family ID, Individual ID, Father ID, Mother ID, Sex, Phenotype, Rest of the columns: Genotypes for each SNP. However, when our .vcf file was converted to the .ped file, the column which should contain the family ID contained also the individual ID. This is because the .vcf file does not contain any population or sample information besides the sample ID. Therefore, the plot information was manually added to the .ped file based on the annotation file used for the principal component analyses. WARNING: make sure there are no spaces in the family name otherwise plink will read this as too many tokens.

Next, the .ped and .map file are converted to a .bed file .. code:: ruby

plink –file /path/to/vcf_to_plink_output –make-bed –out output_file_name

Using this .bed file, we can run the fastSTRUCTURE analysis. As we currently not know how many clusters are present in our sample set we will loop the analyses for 2 to 9 clusters.

conda activate faststructure
for i in `seq 2 9`; do structure.py --input=/path/to/plink_bed_file(no_suffix) --output=output_file_name -K $i --full --seed=100; done
For each cluster, 5 output files are created:
Note: Each line in the output file corresponds with a sample but the sample name is not given in the output. The order of the samples in the output files is the same as the order in the vcf and plink files.

Each column in the output file corresponds to a cluster.

The chooseK.py script was used to get the optimal number of clusters, which is printed to the terminal. .. code:: ruby

conda activate faststructure

chooseK.py –input path/to/output_name_fastSTRUCTURE(no_suffix)

output: Model complexity that maximizes marginal likelihood = 4

The optimal number of clusters found in the wild coffee population was 4. A new file was manually made were the meanQ results belonging to the K = 4 were complemented with a sample ID, plot number and forest disturbance info so that the fastSTRUCTURE result can be visualized in RStudio.

  • input faststructure file

  • visualizing fastSTRUCTURE using R-script

    ```
    R script to visualize fastSTRUCTURE results
    ```
    
    #Install and load required packages
    
    library(adegenet)
    library(tidyverse)
    
    #Set working directory
    
    setwd("/path/to/working/directory")
    getwd()
    
    #Read manually made faststructure file
    
    fs <- read_delim("R_plot_faststructure.txt", delim = "\t", col_names = TRUE)
    
    #Making the barplot
    
    ggplot(fs, aes(fill=Cluster, y=Value, x=Number)) +
    geom_bar(position="stack", stat="identity", width = 1) +
    theme(axis.text.x=element_blank(),
                    axis.ticks.x=element_blank(),
                    axis.text.y=element_blank(),
                    axis.ticks.y=element_blank(),
                    axis.title.x = element_blank(),
                    axis.title.y = element_blank(),
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank())+
    facet_grid(~ Forest,
                            scales = "free_x",
                            space = "free_x",
                            switch = "x")
    
../_images/fastSTRUCTURE_K%3D4.png

The fastSTRUCTURE analysis showed that the plots in undisturbed old-growth forest were divided into two subpopulations, namely plot 16–19 and plot 21–25, with individuals in plots 16–19 showing a mixture of DAPC cluster 1 and 4. Plots in disturbed old-growth forest were divided into four subpopulations, namely: plots 1 and 2; plot 3; plots 5 and 15; plots 11 and 12. Plots in regrowth forest were divided into three subpopulations, namely: plot 4; plot 10; plots 6–9, 13 and 14. Individuals in plot 10 showed to be a mixture of clusters 1 and 4, whereas individuals in plot 4 showed a mixture of all DAPC clusters. Overall, clustering of the samples from old-growth forest was differentiated according to the geographical location of the plots.

Isolation-by-distance

The PCA and fastSTRUCTURE showed genetic structure within the wild coffee population mainly driven by geographical location. To see if this could be a result of isolation-by-distance (IBD), we will calculate genetic distances between the plots, pairwise Fst values (Weir and Cockerham, 1984), using PLINK 1.9 and correlated these genetic distances to the geographical distances between the plots. Therefore, we will use the plink files (also used for fastSTRUCTURE analysis), a list of all plots and a python script:

popfile = "list_lots.txt" #List of populations - one per line
plinkfile = "wild_coffee" #Just the bit before the .bim etc.

import os

#Add pop names to a list
dd = open(popfile)

pops = []
for line in dd:
        pops.append(line.strip())

dd.close()

output = open("pairwise_fst.txt","w")

#Now do a pairwise loop over populations
for i in range(0,len(pops)-1):
        for j in range(i+1,(len(pops))):
                temp = open("temp.txt","w")
                temp.write(pops[i] + ("\n") + pops[j])
                temp.close()
                os.system("plink --bfile " + plinkfile + " --fst --family --keep-fam temp.txt --out temp --autosome-num 37")
                cd = open("temp.log")
                for line in cd:
                        if line.startswith("Mean Fst"):
                                output.write(pops[i] + ("\t") + pops[j] + ("\t") + line.split()[3] + ("\n"))
                                break
                os.system("rm temp.*")

output.close

Output: a text file with the pairwise Fst value (column 3) between the first plot (column 1) and second plot (column 2) To visualize the comparison between plots from different forest disturbance levels (undisturbed, disturbed, and regrowth) an extra column will be manually added to the text file indicating the level of disturbance per plot.

../_images/fastSTRUCTURE_K%3D4.png

Isolation-by-distance was found across all plots explaining the genetic structure found in the wild coffee population.

Establishing a core collection

The wild coffee population in the Yangambi region harbours a high degree of genetic diversity and therefore is a very important genetic resource. To ensure that this high genetic diversity will be preserved, we could opt to create a core collection of this wild population. Genetic data, phenotypical data or agricultural traits can all be taken into account to select the best samples for the core collection, but for this wild coffee population, only the genetic data is currently available.

Core Hunter 3 v3.2.0 R package (De Beukelaer et al., 2018) was used to test two core collection strategies for the 248 samples of the wild coffee population:

  • Maximization strategy (CC-I): maximizes the genetic diversity by selecting the most diverse loci

  • Genetic distance method (CC-X): maximizes the genetic distance between the entries of the core collection and selecting the most diverse plant material within the wild coffee population

For both core collection strategies, allele coverage (CV), diversity within and between alleles [He and Shannon’s index (SH)], and average genetic Modified Roger’s (MR) distance between entry-accession (AN) and entry-to-nearest-entry (EN) were calculated for nine different core sizes (3, 5, 10, 25, 50, 100, 150, 200, and 249 accessions).

The filtered .vcf file will be manually converted to the format needed for CoreHunter. The genotype calls (0/0, 0/1 and 1/1) will be transformed into diploid dosage (0,1,2) and the SNPs rows will be transposed with the sample columns whereby the columns contain data per SNP and the rows contain data per sample. Additionally, if a sample had no genotype call for a specific SNP (./.) this cell will be left empty.

..code-block:: r

` R script to visualize IBD ` # Install and load required packages library(rJava) library(corehunter) library(ggplot2) library(patchwork) library(cowplot) library(tidyverse)

# Import genotype data file genoo <- genotypes(file = “Input_CoreHunter.txt”, format = “biparental”)

# Create CoreHunter data file my.data <- coreHunterData(genoo) my.data

# Test Maximization strategy (CC-I)

## Create empty dataframe to add diversity and distance measures score_CCI <- data.frame(“Size” = numeric(0), “Average entry-to-nearest-entry distance” = numeric(0), “Average accession-to-nearest-entry distance” = numeric(0), “Shannon’s Index” = numeric(0), “Expected heterozygosity”=numeric(0), “Allele coverage”= numeric(0))

for (i in seq(2,(length(genoo$ids)-1),10)) {

## Create core collection with desired size core <- sampleCore(my.data, objective(“AN”, “MR”), size = i, steps = 500)

## Calculate diversity and distance measures within the core collection EN_MR <- evaluateCore(core, my.data, objective(“EN”, “MR”)) AN_MR <- evaluateCore(core, my.data, objective(“AN”, “MR”)) CV <- evaluateCore(core, my.data, objective(“CV”)) SH <- evaluateCore(core, my.data, objective(“SH”)) HE<- evaluateCore(core, my.data, objective(“HE”))

score_CCI[nrow(score_CCI)+1,] <- data.frame(‘Size’ = i, “Average entry-to-nearest-entry distance” = EN_MR, “Average accession-to-nearest-entry distance” = AN_MR, “Expected heterozygosity” = HE, “Shannon’s Index” = SH, “Allele coverage” = CV) }

# Test distance method (CC-X)

## Create empty dataframe to add diversity and distance measures score_CCX <- data.frame(“Size” = numeric(0), “Average entry-to-nearest-entry distance” = numeric(0), “Average accession-to-nearest-entry distance” = numeric(0), “Shannon’s Index” = numeric(0), “Expected heterozygosity”=numeric(0), “Allele coverage”= numeric(0))

for (i in seq(2,(length(genoo$ids)-1),10)) {

## Create core collection with desired size core <- sampleCore(my.data, objective(“EN”, “MR”), size = i, steps = 500)

## Calculate diversity and distance measures within the core collection EN_MR <- evaluateCore(core, my.data, objective(“EN”, “MR”)) AN_MR <- evaluateCore(core, my.data, objective(“AN”, “MR”)) CV <- evaluateCore(core, my.data, objective(“CV”)) SH <- evaluateCore(core, my.data, objective(“SH”)) HE <- evaluateCore(core, my.data, objective(“HE”))

score_CCX[nrow(score_CCX)+1,] <- data.frame(‘Size’ = i, “Average entry-to-nearest-entry distance” = EN_MR, “Average accession-to-nearest-entry distance” = AN_MR, “Expected heterozygosity” = HE, “Shannon’s Index” = SH, “Allele coverage” = CV)}

# Evaluation of the diversity and distance measures

color_list <- c(“#339999”, “#006699”, “#FF9966”, “#CC6666”, “#66CC66” ) measure_list <- c(“Average entry-to-nearest-entry distance”, “Average accession-to-nearest-entry distance”, “Expected heterozygosity”, “Shannon’s Index”, “Allele coverage”) plots <- list() count <- 1

## Create plots for all calculated diversity and distance measures ### WARNING: this codes needs to be run twice: once for CC-I and once for CC-X for (measure in colnames(score_CCX[2:6])) { subtitle <- measure_list[count] plot <- ggplot(score_CCX, aes(x=Size, y= .data[[measure]]))+ geom_point(colour = color_list[count])+ geom_line(colour = color_list[count])+ scale_x_continuous(breaks = seq(min(score_CCX$Size), max(score_CCX$Size), by = 20))+ ggtitle(measure_list[count])+ theme_bw()+ theme(legend.position = “none”, plot.title = element_text(color = color_list[count], size = 12, hjust = 0.5), axis.title.y = element_blank())

plots[[measure]] <- plot count <- count + 1 }

combined_plot <- wrap_plots(plots, ncol=2)+ plot_annotation(title = “Core collection CC-X”, theme = theme(plot.title = element_text(hjust=0.5))) combined_plot

# Create core collection with optimized core size

## Core collection CC-I

CCI <- sampleCore(my.data, objective(“AN”, “MR”), size = 122) CCI_samples <- data.frame(‘Sample_ID’ = CCI$sel) ##save sample ids of the selected samples

## Core collection CC-X CCX <- sampleCore(my.data, objective(“EN”, “MR”), size = 22) CCX_samples <- data.frame(‘Sample_ID’ = CCX$sel) ##save sample ids of the selected samples

# Visualization of core collection samples using PCA

## Here we will use the pca values, which were already calculated in a previous analysis

## Import pca scores

pca <- read_delim(“pca_scores.txt”, delim = “t”, col_names = TRUE)

## Create a column in a new dataframe with the information if the individual is selected as ## an entry for core collection CCI, CCX or both

cc_info <- data.frame(‘Sample_ID’ = as.character(pca$Sample_ID))

cc_info <- Collection %>% mutate(Collection = case_when( cc_info$Sample_ID %in% CCI_samples$Sample_ID & !(cc_info$Sample_ID %in% CCX_samples$Sample_ID) ~”CC-I”, cc_info$Sample_ID %in% CCX_samples$Sample_ID & !(cc_info$Sample_ID %in% CCI_samples$Sample_ID) ~”CC-X”, cc_info$Sample_ID %in% CCI_samples$Sample_ID & cc_info$Sample_ID %in% CCX_samples$Sample_ID ~”CC-I and CC-X”, TRUE ~ “Not in core collections” ))

## Merge the new dataframe with the pca values pca_info <- merge(cc_info, pca, by = “Sample_ID”)

## Create plot

pca_plot <- ggplot(pca_info, aes(PC1, PC2, color = Collection))+ geom_point(alpha = 0.4, size = 3)+ xlab(“PC1 ()”)+ ylab(“PC2 ()”)+ ggtitle(“PCA core collections of wild coffee population”)+ guides()+ theme_bw()

pca_plot

Using the values of the PCA, which we had created earlier, we can show which of the samples were selected for the core collections CC-I and CC-X

../_images/pca_core_collection_plot.png

On the plot we can see that accessions assigned to the CC-I and CC-X core collection were evenly distributed across the ordination space of the wild coffee population.

Characterization of the genetic composition and establishment of a core collection of cultivated C. canephora using HiPlex SNP markers

Purpose

The genetic background of a Coffea canephora cultivar collection grown in the Congo Basin was analysed to determine the different genetic sources of the crop. HiPlex sequencing using 86 loci was performed on 270 C. canephora samples and 9 samples were genotyped using GBS. In addition, HiPlex sequencing with the same loci was performed on 48 samples of local wild C. canephora and 35 samples of a variety of coffee species. At last, whole genome shotgun (WGS) sequencing data of two reference samples of the Congolese subgroup A was retrieved from SRA (Bioproject PRJNA803612). After sequencing, the following workflow was executed to visualize the population structure of coffee in the Congo Basin:

Installation of software

Downloading the specific software versions used in this tutorial will ensure all displayed commands operate properly. The latest software versions can also be downloaded.

  • Python (3.8.13)

  • SMAP package (4.6.5)

    • Using wget and virtual environment:

    wget https://gitlab.com/truttink/smap/-/archive/4.6.5/smap-4.6.5.tar.bz2
    
    tar -xvf smap-4.6.5.tar.bz2
    
    cd smap-4.6.5/
    
    python3 -m venv .venv
    
    source .venv/bin/activate
    
    pip install --upgrade pip
    
    pip install .
    
    #Test if installation was successful:
    
    smap delineate --version
    
    #Output:
    
    (date) root - INFO: This is SMAP 4.6.5
    (date) Delineate - INFO: SMAP-delineate started.
    4.6.5
    
  • QualiMap (2.2.1)

    wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip
    
    unzip qualimap_v2.2.1.zip
    
    cd qualimap_v2.2.1
    
    #Test if installation was successful:
    ./qualimap multi-bamqc
    
    #Output (first part):
    Java memory size is set to 1200M
    Launching application...
    QualiMap v.2.2.1
    Built on 2016-10-03 18:14
    Selected tool: multi-bamqc
    
    #Possible error when using newer Java versions:
    Java memory size is set to 1200M
    Launching application...
    Unrecognized VM option 'MaxPermSize=1024m'
    Error: Could not create the Java Virtual Machine.
    Error: A fatal exception has occurred. Program will exit.
    
    #Fix by editing qualimap file:
    nano ./qualimap
    
    #Change -XX:MaxPermSize=1024m to new java option: -XX:MaxMetaspaceSize=1024m
    #Test installation again
    
  • Docker (23.0.4)

  • BCFtools (1.13)

    • Install locally:

    wget https://github.com/samtools/bcftools/releases/download/1.13/bcftools-1.13.tar.bz2
    
    tar -xvf bcftools-1.13.tar.bz2
    
    cd bcftools-1.13/
    
    ./configure
    
    #if error => install dependencies depending on your OS (packages stated in terminal)
    #=> ./configure
    
    make
    
    sudo make install
    
    #Test if installation was successful:
    
    bcftools --version
    
    #Output:
    
    bcftools 1.13
    Using htslib 1.13
    Copyright (C) 2021 Genome Research Ltd.
    License Expat: The MIT/Expat license
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    
    docker pull staphb/bcftools:1.13
    
    #Test if installation was successful:
    
    docker run staphb/bcftools:1.13 bcftools --version
    
    #Output:
    
    bcftools 1.13
    Using htslib 1.13
    Copyright (C) 2021 Genome Research Ltd.
    License Expat: The MIT/Expat license
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    
  • GATK (3.7-0) docker image

    docker pull broadinstitute/gatk3:3.7-0
    
    #Test if installation was successful:
    
    docker run broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar --version
    
    #Output:
    
    3.7-0-gcfedb67
    
  • VCFtools (0.1.16)

    • Install locally:

    wget https://github.com/vcftools/vcftools/releases/download/v0.1.16/vcftools-0.1.16.tar.gz
    
    tar -xvzf vcftools-0.1.16.tar.gz
    
    cd vcftools-0.1.16/
    
    ./configure
    
    #if error => install dependencies depending on your OS (packages stated in terminal)
    #=> ./configure
    
    make
    
    sudo make install
    
    #Test if installation was successful:
    
    vcftools --version
    
    #Output:
    
    VCFtools (0.1.16)
    
    docker pull biocontainers/vcftools:v0.1.16-1-deb_cv1
    
    #Test if installation was successful:
    
    docker run biocontainers/vcftools:v0.1.16-1-deb_cv1 vcftools --version
    
    #Output:
    
    VCFtools (0.1.16)
    
  • PLINK (version 1.90b6.18)

    • Install locally:

    mkdir PLINK1.19
    
    cd PLINK1.19
    
    wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20200616.zip
    
    unzip plink_linux_x86_64_20200616.zip
    
    #Move to a directory in PATH environment variable
    #for command line tool execution (e.g. /usr/bin)
    sudo mv plink /usr/bin/
    
    #Test if installation was successful:
    
    plink --version
    
    #Output
    
    PLINK v1.90b6.18 64-bit (16 Jun 2020)
    
    docker pull elixircloud/plink:1.9-cwl
    
    #Test if installation was successful:
    
    docker run elixircloud/plink:1.9-cwl plink --version
    
    #Output:
    
    PLINK v1.90p 64-bit (16 Jun 2020)
    
  • fastSTRUCTURE 1.0

    conda create --name faststructure python=2.7
    
    conda activate faststructure
    
    conda install -c bioconda faststructure
    
    #Test if installation was successful:
    
    chooseK.py
    
    #Output:
    
    Here is how you can use this script
    Usage: python /home/guest/miniconda3/envs/fastStructure/bin/chooseK.py
            --input=<file>
    
  • RStudio

Read mapping quality control

The method for read mapping quality control (QC) depends on the NGS method. Mapping quality can be evaluated with QualiMap for WGS, GBS and HiPlex reads. SMAP provides a more specific means of QC for mapped GBS reads: SMAP delineate.

SMAP delineate gives us insight into locus completeness and sequencing read depth across all samples. This is important for our workflow because no variants can be identified in samples where no reads are mapped in polymorphic regions.

smap delineate *path_to_GBS_bam_files* -mapping_orientation ignore -p 8 --plot all --plot_type png -f 50 -g 300 --min_stack_depth 2 --max_stack_depth 500 --min_cluster_depth 10 --max_cluster_depth 1500 --max_stack_number 2 --min_stack_depth_fraction 10 --completeness 1 --max_smap_number 10

Parameters were set for double-enzyme GBS with merged reads in diploid individuals as described in SMAP delineate examples.

The quality of the mapped reads is depicted by the MergedCluster barplots and the saturation curve scatterplot.

Median read depth per Mergedcluster

../_images/final_stack_positions.MergedCluster.MedianRD.histogram6.png

The median read depth per MergedCluster barplot depicts how many MergedClusters are shared between at least half of the samples at a given depth. The completeness of the mapped reads can be determined by comparing the individual StackCluster read depth barplots to this one.

All of our GBS samples have very similar read mapping distributions.

Mergedcluster completeness across samples

../_images/final_stack_positions.MergedCluster.Completeness.histogram6.png

This barplot shows the distribution of MergedClusters completeness across all samples. A MergedCluster is plotted at x-value 1 if that MergedCluster only has a sufficient read depth in one sample. In contrast, MergedClusters with sufficient read depth across all samples are plotted at the maximal x-value (number of samples in the sample set).

Around 14000 MergedClusters are shared across all our samples because they have sufficient read depth in all samples (x = 9). Still, a lot of diversity if present judging by the high number of MergedClusters with no sufficient read depth for some samples.

SMAPs per MergedCluster

../_images/final_stack_positions.MergedCluster.SMAP.histogram6.png

The abundance of read mapping polymorphisms is depicted in the SMAPs per MergedCluster barplot. All MergedClusters where reads have the same read mapping start and end positions contain 2 SMAPs. InDels and SNPs cause an increased number of SMAPs per MergedCluster.

Around 7000 MergedClusters display read mapping polymorphisms in the samples with GBS data.

Calling SNPs

Next, BCFtools and GATK were compared for SNP calling. Polymorphisms were identified, SNPs were selected and filtered on position and quality.

SNP calling

The Unified Genotyper tool of GATK was used to perform the SNP calling by comparing the mapped reads to the reference genome of C. canephora. We provided a bed file containing the HiPlex loci regions because we are only interested in the SNPs inside them. This region-specific SNP calling greatly reduces computation time. The reference genome bed file has to be included if you need to call SNPs in the entire genome. A lot more resources are required for complete SNP calling but it can be worthwhile if the vcf file will be used in more pipelines.

docker run -w *working_directory_in_container* -v *path_in_host_to_link*:*path_in_container_to_link* broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar \
        -R *path_to_reference_genome_fasta_file* \
        -T UnifiedGenotyper -I *file_with_bam_files* \
        -o *path_to_vcf_file* \
        -L *path_to_bed_file*

The command also needs a .list file that contains all bam file names (\n separated).

The following command was used to determine the number of SNPs:

grep -v ^# *path_to_vcf_file* | wc -l
output: 12130453

SNP filtering

Amplicon region filter

After complete calling it is still possible to filter for SNPs inside the specified amplicon regions with following command:

vcftools --vcf *path_to_vcf_file* --bed *path_to_bed_file* --recode --recode-INFO-all --out *path_to_output_1*

output:
After filtering, kept 364 out of 364 Individuals
Outputting VCF file...
After filtering, kept 1230 out of a possible 12130453 Sites
Run Time = 201.00 seconds

VCFtools was used to only retain SNPs inside the amplicon regions.

It is possible the following warning is printed in the terminal:

Example warning:
Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

VCFtools does not expect commas in the description tag of the header lines. This warning did not have an influence on our input but if so, the warning can be prevented by removing/changing all commas in the description.

SNP/sample quality filter

vcftools --vcf *path_to_output_1* --min-meanDP 30 --mac 4 --minQ 20 --minGQ 30.0 --minDP 10.0 --recode --out *path_to_output_2*

Output:
After filtering, kept 364 out of 364 Individuals
Outputting VCF file...
After filtering, kept 415 out of a possible 1230 Sites
Run Time = 0.00 seconds

This command filtered on:

  • Minimum average depth per sample of 30

  • Minimum allele count of 4

  • Minimum SNP quality of 20

  • Minimum genotype quality of 30

  • Minimum depth per sample of 10

Polymorphic SNP filtering

docker run -v *path_in_host_to_link*:*path_in_container_to_link* broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar \
        -T SelectVariants \
        -R *path_to_reference_genome_fasta_file* \
        -V *path_to_output_2* \
        --excludeNonVariants \
        -o *path_to_output_3*

The SelectVariants tool of GATK was used to retain polymorphic SNPs. The command does not output information about the filtering in the terminal.

Check SNP number:

grep -v ^# *path_to_output_3* | wc -l

output: 415

Previous filtering apparently already removed non-polymorphic sites.

Biallelic SNP filtering

docker run -v *path_in_host_to_link*:*path_in_container_to_link* broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar \
        -T SelectVariants \
        -R *path_to_reference_genome_fasta_file* \
        -V *path_to_output_3* \
        -restrictAllelesTo BIALLELIC \
        -o *path_to_output_4*

Example terminal output (for subset of 10 samples).

Next, the SelectVariants tool of GATK was used to only retain biallelic SNPs.

Check SNP number:

grep -v ^# *path_to_output_4* | wc -l

output: 402

The result of the SNP calling and filtering with GATK/VCFtools was a total of 402 SNPs. This is be compared with BCFtools in the next tab.

PCA

A principal component analysis (PCA) was performed in RStudio on the filtered SNPs to quantify the genetic distances between samples of different locations and species. SNPs from both GATK and BCFtools were used and their results were compared. An annotation file is needed to annotate the samples with more information, such as sampling location, species information, sample ID, etc.

This renv lock file in combination with the renv package can be used to recreate our exact RStudio environment. Put the renv lock file in a directory where you have permissions and where no other lock file is present. Skip the renv code block in the script below if you want to use the latest package versions.

The results from both GATK and BCFtools are shown below. The scatterplots were put on the same scale for comparison.

../_images/GATK.png

C. canephora samples are separated from the rest of the coffee species on the PC1-axis. The C. canephora individuals are separated into groups of wild and cultivated origin on the PC2-axis. A couple of individuals, thought to be C. canephora, are placed among the other species. Conversely, an individual labeled as Coffea congensis was grouped together with C. canephora.

fastSTRUCTURE

Next, the genetic composition of the samples was determined using fastSTRUCTURE, which requires a PLINK formatted genotype call file as input.

vcftools --vcf *path_to_filtered_vcf_file* --plink --out *path_to_output1*

First, the filtered vcf file was converted to the PLINK format and .map and .ped files were created. Do not add any suffix to the output file name, this will be added automatically.

We got some warnings after running the command:

example warning:
Unrecognized values used for CHROM: Cc_chr1 - Replacing with 0.

The resulting .map file should contain 4 columns (without headers): Chromosome name, SNP name, Position in morgans or centimorgans (optional), Base-pair coordinates.

Example .map

0

Cc_chr1:4736327

0

4736327

0

Cc_chr1:4736344

0

4736344

0

Cc_chr1:4736368

0

4736368

0

Cc_chr1:4736369

0

4736369

0

Cc_chr1:4736373

0

4736373

However, the first column of our .map file only contained zeros because VCFtools does not recognize the chromosome notation in the vcf file. Therefore, the chromosome number was added to the .map file with a python script.

python correct_map_ped.py --map *path_to_map_file* --mapout *path_to_output_map_file*

output:
Used parameters for the script:
Input map file =         Coffea.map
Output map file =        Coffea_edited.map
Input ped file =         None
Output ped file =        None
Annotation file =        None
Sample ID column =       Sample_ID
Population ID column =   Population_ID

Format of the converted .map file:

Edited .map

01

4736327

0

4736327

01

4736344

0

4736344

01

4736368

0

4736368

01

4736369

0

4736369

The .ped file has the following columns: Family ID, Individual ID, Father ID, Mother ID, Sex, Phenotype, Rest of the columns: Genotypes for each SNP.

Example .ped

Wild_plot_1843

Wild_plot_1843

0

0

0

0

A

A

Wild_plot_1844

Wild_plot_1844

0

0

0

0

G

G

Wild_plot_1845

Wild_plot_1845

0

0

0

0

G

G

However, in our case the first 6 columns of the .ped file only had one distinct value: the family ID and Individual ID are the same, and the other 4 columns contain zeros. The same python script was used to add a family ID to the .ped file which is required for fastSTRUCTURE.

An annotation file is needed for the conversion: example

python correct_map_ped.py --ped *path_to_ped_file* --pedout *path_to_output_ped_file* --annot *path_to_annotation_file* --pop_ID *column_name_of_population_ID_in_annotation_file* --sample_ID *column_name_of_sample_ID_in_annotation_file*

output:
Used parameters for the script:
Input map file =         None
Output map file =        None
Input ped file =         Coffea.ped
Output ped file =        Coffea_edited.ped
Annotation file =        Annotations.txt
Sample ID column =       Sample_ID
Population ID column =   Collection

Format of the converted .ped file:

Edited .ped

Wild_Collection

Wild_plot_1843

0

0

0

0

A

A

Wild_Collection

Wild_plot_1844

0

0

0

0

G

G

Wild_Collection

Wild_plot_1845

0

0

0

0

G

G

Note that the python script can perform both conversions simultaneously.

plink --file *path_to_output1(no_suffix)* --make-bed --out *path_to_output2(no_suffix)*

Next, the converted .ped and .map files were used to create .bed and .map files using PLINK. Again, do not add any suffix to both the input and output file.

Example output files:

It is normal that the outputted .bed file is in a compressed format.

Finally, fastSTRUCTURE was run:

conda activate faststructure

For i in `seq 2 9`; do structure.py --input= *path_to_output2(no_suffix)* --output= *path_to_output3(no_suffix)* -K $i --full --seed=100; done

The tool requires the user to specify the number of clusters (-K option). fastSTRUCTURE was executed with multiple -K values because we did not know the optimal cluster number for our data. The output was examined afterwards to determine the best -K value.

The tool should output 5 files per -K. Examples for K = 4:

  • .log

  • .meanQ: mean ancestry proportion per sample

  • .meanP: mean error rate of proportion per sample

  • .varQ: variance of mean ancestry proportion per sample

  • .varP: variance of mean error rate of proportion per sample

conda activate faststructure

chooseK.py --input *path_to_output3(no_suffix)*

Output:
Model complexity that maximizes marginal likelihood = 4
Model components used to explain structure in data = 4

The chooseK.py script was used to get the optimal number of clusters which is printed in the terminal.

The fastSTRUCTURE results belonging to the optimal K-value were visualized with a R script.

../_images/faststructure_plot.png

The barplot depicts the population structure of each sample location where the colors define the different genetic resources: green (Congolese subgroup A), red (Lula cultivars), blue (wild) and purple (other Coffea spec.). All four genetic resources were present in the collection where most of the samples were assigned to the Lula cultivars (red). Some sample from the coffee collection showed an admixed composition of Lula-Wild or Lula-Congolese subgroup A.

SMAP haplotype-sites

Multi-allelic haplotypes in the amplicon regions of the coffee samples were determined with SMAP haplotype-sites based on the filtered SNPs. The output is a table containing discrete haplotype calls for all the samples.

smap haplotype-sites *path_to_data* *path_to_bed_file* *path_to_filtered_vcf_file* -mapping_orientation ignore -partial exclude --no_indels --min_read_count 10 --min_haplotype_frequency 2 -p 8 -j 2 -t png -o *path_to_output* -e dosage -i 10 10 90 90 -z 2

The program requires three input arguments:

  • Location of the bam files to be analysed

  • Location of the bed file containing the amplicon regions

  • Location of the filtered vcf file

Mapping orientation was set to ignore because our reads are merged before mapping them to the reference genome. Use stranded orientation for GBS single-end or paired-end reads that are mapped separately. Only reads that completely cover the locus, including both start and end points, were considered for haplotyping with -partial exclude.

-j and -z were set to 2 because we expect an allele count of 2 per locus per individual for a diploid organism. -e was set to dosage to work with allele copy numbers instead of allele frequencies. The -i option, specifying the frequency interval bounds, was set to the default for diploid species (See options for discrete calling in individual samples).

SMAP grm

The pairwise genetic similarities between all the samples was calculated and plotted using SMAP grm.

We used a python script adapted from SMAP grm because the command line tool has some issues with plotting the similarity heatmap. The python script was first executed to get a genetic distance matrix file which was then used to define the sample plotting order by clustering similar samples together.

python SMAP_grm.py --table *path_to_filtered_discrete_tsv_file* -p 32 --print_sample_information Matrix \
--similarity_coefficient Jaccard -lc 0.1 --distance

--distance ensured a distance matrix is outputted which was needed to create a phylogenetic tree later on. The lc option was set to 0.1 to filter loci present in less than 10% of the samples. Jaccard was chosen as the similarity coefficient because it is appropriate for discrete dosage calls.

Jaccard similarity coefficient: a / (a + b + c)

A text file can be provided to SMAP grm containing 1 mandatory and 2 optional columns. The first column contains the sample plotting order (mandatory). The second column is used to convert the sample names to new ones (optional). The third column contains group IDs to color the sample labels by (optional). Note that an empty second column is still needed if you only want to color the labels by group and not change the sample name!

The distance matrix file was loaded in RStudio to determine the best sample plotting order for the heatmaps. We created two plotting orders in RStudio: one where only the pairwise distances are used to cluster samples together (Global clustering), and one where samples are first grouped by location and then clustered by pairwise distances to determine the structure of each sample location (Location clustering). The resulting sample orders were exported to a text file, with one sample per line.

The Rscript can be downloaded here.

Different clustering techniques are used in the script to order the samples:

  • Single Linkage:

    • uses the minimum distance between any two points in different clusters to merge them

    • fast and can perform well on non-globular data

    • performs poorly in the presence of noise

  • Complete Linkage:

    • uses the maximum distance between any two points in different clusters to merge them

    • performs well on cleanly separated globular clusters

  • Median Linkage:

    • uses the median distance between all pairs of points in different clusters to merge them

    • compromises between single and complete linkage

  • Average Linkage:

    • computes the average distance between all pairs of points in different clusters to merge them

    • performs well on cleanly separated globular clusters

  • Centroid Linkage:

    • merges clusters by computing the distance between the centroids of different clusters

    • useful when dealing with spherical-shaped clusters

  • Ward’s Method:

    • minimizes the total within-cluster variance when merging clusters

    • well-suited for datasets with a moderate to large number of observations

    • most effective method for noisy data

  • McQuitty’s Method

    • uses the average distances between centroids of different clusters to merge them

    • suitable when dealing with large datasets

We chose Ward’s method (ward.D2) because we don’t know the amount of noise in the data and it is widely used.

python SMAP_grm.py --table *path_to_filtered_discrete_tsv_file* -n *path_to_sample_order_file* \
-p 32 --similarity_coefficient Jaccard -lc 0.1 \
--print_sample_information Plot --plot_format png --title_fontsize 100 \
--label_fontsize 18 --tick_fontsize 6 --legend_fontsize 100 --colour_map Spectral

The heatmap plots the pairwise similarity scores between all samples. These similarity values can be added in each cell of the heatmap with the option --annotate_matrix_plots, but this significantly increases the file size. The different font sizes might have to be increased or decreased depending on the size of your dataset. The color map can of course also be changed depending on your taste. Optionally, the dpi can be lowered for very large datasets with -r.

The heatmaps sorted by location clustering and global clustering are shown in the tabs below.

../_images/Jaccard_genetic_similarity_heatmap_grouped.png

(To better see the sample_IDs, download the full image)

The population order is retained in this heatmap to determine the structure of each sample location. The sample labels have been colored according to their location.

The Yangambi collection has 3 subgroups with high in-between similarity. Additionally, one sample shows severe dissimilarity across all samples except for two Psilanthus manni samples. The reference samples of the Congolese subgroup A only shows high similarity with each other. The Wild samples are by far the most related to each other than to other samples the collection or Congolese subgroup A. Finally, there are samples from the Yangambi collection that are more similar to some Coffea spec. samples indicating those collection samples are not C. canephora.

Location clustering also serves as a confirmation of the fastSTRUCTURE results. For instance, there was a large presence of the blue cluster in the Yangambi collection, belonging to the Lula-Wild samples. In the heatmap, we can also see a subgroup in the Yangambi collection with larger similarities to the Wild samples than to the rest of its own group. On the other hand, it is harder to determine the genetic identity of the samples. This will be easier to spot in the heatmap with global clustering.

Phylogenetic tree

Next, a phylogenetic tree was created in RStudio to visually represent both the genetic distances between our samples and the presence of different genetic resources in each sample location. The Neighbour Joining algorithm was used because it can be directly applied to the distance matrix created with SMAP grm. Moreover, branch lengths separating the samples are determined (unlike, for instance, UPGMA) and the tree can be easily rooted using an appropriate outgroup.

An annotation file was used to color the tree branches by the sample locations. Our package versions can again be obtained with a renv lock file.

Core collection

Establishing a core collection is a great way to ensure future sustainable and effective conservation management and use of any collection.

Two strategies to propose core collections were tested using the corehunter RStudio package:

  • Maximization strategy (CC-I)

    • maximizes the genetic diversity by selecting the most diverse loci

  • Genetic distance method (CC-X)

    • maximizes the genetic distance between the entries of the core collection

We were interested in creating core collections for the C. canephora samples of the Yangambi collection. Therefore, we deleted all non Yangambi samples and all non C. canephora samples with VCFtools, including the likely wrongly labelled C. canephora sample from the Yangambi collection. A file containing samples to be retained has to be provided to the command.

vcftools --vcf *path_to_vcf_file* --keep *file_with_desired_samples* --recode --out *path_to_output*

Next, the filtered vcf file from BCFtools was converted to the proper format for corehunter. In short, the GT column was simplified and all SNPs and samples were assigned an ID. The IDs before and after conversion were added to a cross reference file. Additionally, any SNPs that had no GT calls (0/0 or ./.) for any samples were removed and put in a text file for control.

Conversion from vcf to core hunter file was performed with a custom python script.

python convert_vcf_to_core.py --input *path_to_(subsetted)_vcf_file* --core *path_to_core_file* --cross_ref *path_to_cross_ref_file*

Again, a renv lock file can be used to recreate our exact RStudio environment. Be sure to put the lock file in a directory where you have permissions and where no other renv lock file is present.

../_images/Core_PCA_no_annot.png

Using the sampleCore function of Core Hunter we selected the most suitable samples for each core collection strategy and visualized them in the PCA scatterplot. The optimal core size for both core collection strategies is also stated in the legend. The samples assigned to both CC-I and CC-X core collection were evenly distributed across the ordination space of the Yangambi coffee collection. Only 3 samples are present in both CC-I and CC-X collections.

Population genetics of bullfrog, an invasive species

This study involves several hunderd diploid individuals, sampled from 38 locations. Each individual was genotyped using double-digest GBS (PstI-EcoRI). These populations were sampled in Belgium, where bullfrog is an invasive species. The purpose of the study is to analyse population structure and relate this to geographical spread of the species.

GBprocesS

Demultiplexing was done using GBprocesS. See instruction for download and installation of GBprocesS. This is the config (.ini) file used for demultiplexing, trimming, merging forward and reverse reads, base quality filtering, and finaly removal of reads with internal restriction sites. Note that library preparation for such very large sample sets is usually done in several batches, whereby each batch reuses the sample-specific barcodes in the barcode adapter, and each batch is amplified with different sets of i5 and i7 Illumina barcodes during the PCR that adds the flowcell binding sites. In case some of the batch does not use all barcodes, it is important to include a “dummy” barcode in the demultiplexing (with non-existing sequence and the maximum length barcode used across the sample set). This is to make sure that all reads of each sample are trimmed to the same maximal length in the demultiplexing step. This is an example for a barcode file for demultiplexing of incomplete plates. For further explanation see the manual of GBprocesS.

Mapping reads with BWA-MEM

After demultiplexing and merging reads, they can be directly mapped onto a reference genome sequence. Here, we use the publicly available reference genome of bullfrog.

Prepare the reference genome and index it:

BWA

Map the reads of all samples onto the reference genome:

BWA mem

SMAP delineate

A first quality control of any GBS experiment is to run SMAP delineate on the BAM files of the entire sample set.

SMAP delineate:

smap delineate $download_dir/Real_data/SMAP_delineate/input/ind/ -mapping_orientation ignore --processes 8 --plot all --plot_type pdf --min_stack_depth 2 --max_stack_depth 1500 --min_cluster_length 50 --max_cluster_length 300 --max_stack_number 20 --min_stack_depth_fraction 10 --min_cluster_depth 10 --max_cluster_depth 1500 --max_smap_number 20 --name 48_ind_GBS-PE

SMAP delineate will provide an overview of the read mapping quality, locus length distribution, locus read depth distribution, the number of loci per sample (saturation curve) and locus completeness across the sample set.

../_images/Bullfrog_saturation_curve_557_samples_RD15_C5.png

In this case (as shown by the saturation curve), the normalisation of the sample set was not optimal. Almost none of the samples reached locus saturation, with more than 100.000 loci detected for the samples with the largest library size (5-6M reads mapped). This can be explained by the large genome size of the species (about 3 Gb). Most of the samples are positioned on the steep part of the saturation curve (2-3M reads, 40-60k loci), meaning they have insuffcient reads to cover all possible loci with minimal read depth (here RD 15), and sequencing deeper would have yielded more loci per sample. This relatively high level of missing data (due to technical reasons) affects the possibility to compare samples by “common loci”. In addition, some of the samples have nearly no reads (100-500k reads) and a very minimal number of detected loci and are ‘dropout’ samples. The question, then, is how to set a threshold for minimal library size to retain good quality samples?

Effect of library size on genetic relationship estimations

To estimate the number of reads required per sample to estimate stable genomic fingerprints and to quantitatively estimate population genetics parameters, we first investigate how library size affects estimation of genetic relatedness. For this, we chose 14 samples with very high library size (4-6M merged reads after demultiplexing) representing different sampling locations.

we quickly checked (with SMAP haplotype-window and SMAP grm) that these samples indeed represent the genetic diversity (range 0.7-1.0) across the sample set (sneak preview of the rest of the analysis).

../_images/14_comp_subsampling_GRM.png

We then created ‘saturation’ curves that show how many reads are required per sample to yield a given number of loci. Furthermore, we analysed how a reduced number of loci affects pairwise genetic relationship estimation (using SMAP delineate, SNP calling with SAMtools, SMAP haplotype-sites and finally SMAP grm). The purpose of this step is to estimate the minimum number of reads required per sample, and conversely, to select samples with at least that number of reads for the rest of the analysis. Other samples with lower number of reads are excluded, as they have too little data to yield good quality genetic fingerprints.

We created a series of simulated incomplete samples by computationally subsampling the .fastq files:

head -n 400000 01_21_01.fq > 01_21_01_0.10.fq
head -n 800000 01_21_01.fq > 01_21_01_0.20.fq
head -n 1200000 01_21_01.fq > 01_21_01_0.30.fq
head -n 1600000 01_21_01.fq > 01_21_01_0.40.fq
head -n 2000000 01_21_01.fq > 01_21_01_0.50.fq
head -n 3000000 01_21_01.fq > 01_21_01_0.75.fq
head -n 4000000 01_21_01.fq > 01_21_01_1.00.fq
head -n 5000000 01_21_01.fq > 01_21_01_1.25.fq
head -n 6000000 01_21_01.fq > 01_21_01_1.50.fq
head -n 7000000 01_21_01.fq > 01_21_01_1.75.fq
head -n 8000000 01_21_01.fq > 01_21_01_2.00.fq
head -n 9000000 01_21_01.fq > 01_21_01_2.25.fq
head -n 10000000 01_21_01.fq > 01_21_01_2.50.fq
head -n 11000000 01_21_01.fq > 01_21_01_2.75.fq
head -n 12000000 01_21_01.fq > 01_21_01_3.00.fq

Here, the name of the files is coded as follows (split by '_'): first code indicates the location, the second code the individual, the third the replicate. numbers at the end of the file indicate the number of reads included after subsampling (in M) to simulate incomplete data.
This code is repeated for each of 14 samples (.fastq files).

We then ran SMAP delineate again to create the corresponding saturation curve:

smap delineate $download_dir/Real_data/SMAP_delineate/input/ind/ -mapping_orientation ignore --processes 8 --plot all --plot_type pdf --min_stack_depth 2 --max_stack_depth 1500 --min_cluster_length 50 --max_cluster_length 300 --max_stack_number 20 --min_stack_depth_fraction 10 --min_cluster_depth 10 --max_cluster_depth 1500 --max_smap_number 20 --name 48_ind_GBS-PE
../_images/Bullfrog_saturation_curve_276_samples_RD10_C5.png

This curve indeed shows that the shape of the locus saturation curve observed for the entire sample set can be expained by suboptimal library size of a substantial number of samples.

Next, we ran SAMtools to identify SNPs in the 14 samples with complete data:

samtools

We used the .bed file with locus coordinates created by SMAP delineate to select 5000 loci with 2 SMAPs and completeness across all 14 samples. Running SAMtools on this locus set ensures that there is data in all 14 samples, and that in principle all pairwise comparisons to the “ground truth” can be made (as the ground truthh is based on “complete data”). Samtools identified 4000 SNPs in about 2500 loci.

Then, we ran SMAP haplotype-sites to create haplotype tables, and ran SMAP grm to estimate genetic relatedness (Jaccard similarity of loci with partially shared alleles):

smap haplotype-sites
smap grm
../_images/GRM_similarity_saturation_curves_01_21.png

This shows that the estimated genetic relatedness is stable from 1 - 1.25M or more merged reads. With less read data, the estimation of genetic fingerprints starts to deviate from the ground truth (compared to its own sample with saturated data, value expected to be 1.00).

Selection of good quality samples

Now that we have estimated that using samples with at least 1M merged reads is required for good quality population genetic analysis, we will only continue with those 455 samples. These are distributed across the 38 locations as follows.

../_images/nr_ind_per_location_1M.png

We next run SMAP delineate again to estimate the completeness and saturation of this sample set.

SMAP delineate:

smap delineate $download_dir/Real_data/SMAP_delineate/input/ind/ -mapping_orientation ignore --processes 8 --plot all --plot_type pdf --min_stack_depth 2 --max_stack_depth 1500 --min_cluster_length 50 --max_cluster_length 300 --max_stack_number 20 --min_stack_depth_fraction 10 --min_cluster_depth 10 --max_cluster_depth 1500 --max_smap_number 20 --name 48_ind_GBS-PE
../_images/Bullfrog_saturation_curve_455_samples_RD10_C5.png

Identify SNPs

Instructions to identify and filter SNPs

SMAP haplotype-sites

Create and navigate to a new output directory to run SMAP haplotype-sites on the BAM files with mapped GBS reads of a set of individuals, its BED file from SMAP delineate, and a VCF file with SNP calls (see for third-party SNP calling software: e.g. SAMtools, BEDtools, Freebayes, or GATK for individuals):

mkdir -p $download_dir/Real_data/SMAP_haplotype_sites/output/ind
cd $download_dir/Real_data/SMAP_haplotype_sites/output/ind
smap haplotype-sites $download_dir/Real_data/SMAP_delineate/input/ind/ $download_dir/Real_data/SMAP_delineate/output/ind/final_stack_positions_48_ind_GBS-PE_C0_SMAP20_CL50_300.bed $download_dir/Real_data/SMAP_haplotype_sites/input/48_ind_GBS-PE.vcf --out haplotypes_48_ind_GBS-PE -mapping_orientation ignore --discrete_calls dosage --frequency_interval_bounds diploid --dosage_filter 2 --plot all --plot_type pdf -partial include --min_distinct_haplotypes 2 --min_read_count 10 --min_haplotype_frequency 5 --processes 8

SMAP grm

Create and navigate to a new output directory to run SMAP grm on the haplotype call table of a set of individuals:

mkdir -p $download_dir/Real_data/SMAP_grm/output/ind
cd $download_dir/Real_data/SMAP_grm/output/ind
smap grm $download_dir/Real_data/SMAP_delineate/input/ind/ $download_dir/Real_data/SMAP_delineate/output/ind/final_stack_positions_48_ind_GBS-PE_C0_SMAP20_CL50_300.bed $download_dir/Real_data/SMAP_haplotype_sites/input/48_ind_GBS-PE.vcf --out haplotypes_48_ind_GBS-PE -mapping_orientation ignore --discrete_calls dosage --frequency_interval_bounds diploid --dosage_filter 2 --plot all --plot_type pdf -partial include --min_distinct_haplotypes 2 --min_read_count 10 --min_haplotype_frequency 5 --processes 8

Deactivate your virtual environment:

deactivate