.. raw:: html .. role:: purple .. raw:: html .. role:: white .. raw:: html .. role:: green .. role:: blue .. role:: red .. _SMAPtutorial_smap_design_potato: ################################################################### Design a CRISPR/Cas experiment with gRNA and HiPlex panels (potato) ################################################################### Activate your virtual environment :: source .venv/bin/activate Create the reference sequences ------------------------------ Define the path to the download directory:: download_dir=$/PATH/TO/DOWNLOAD_DIR/ Or, create a new download dir:: cd ../ mkdir download cd download download_dir=$(pwd) Download genomes and gene annotation data:: cd $download_dir mkdir originals cd originals wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.HOMFAM.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.ORTHOFAM.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Genomes/stu.fasta.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.all_features.stu.gff3.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.exon_features.stu.gff3.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Annotation/annotation.selected_transcript.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/IdConversion/id_conversion.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Descriptions/gene_description.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GO/go.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/InterPro/interpro.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/MapMan/mapman.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/proteome.selected_transcript.stu.fasta.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/transcripts.selected_transcript.stu.fasta.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/cds.selected_transcript.stu.fasta.gz Unzip and change some formats for **SMAP target-selection** and place in SMAP_reference directory:: mkdir ../../SMAP_reference zcat genefamily_data.HOMFAM.csv.gz | grep -v "#" > ../../SMAP_reference/genefamily_data.HOMFAM.csv zcat genefamily_data.ORTHOFAM.csv.gz | grep -v "#" > ../../SMAP_reference/genefamily_data.ORTHOFAM.csv zcat stu.fasta.gz > ../../SMAP_reference/stu.fasta zcat annotation.selected_transcript.stu.csv.gz > ../../SMAP_reference/annotation.selected_transcript.stu.csv zcat annotation.selected_transcript.all_features.stu.gff3.gz > ../../SMAP_reference/annotation.selected_transcript.all_features.stu.gff3 These files may be used later:: zcat annotation.selected_transcript.all_features.stu.gff3.gz > ../../SMAP_reference/annotation.selected_transcript.all_features.stu.gff3 zcat annotation.selected_transcript.exon_features.stu.gff3.gz > ../../SMAP_reference/annotation.selected_transcript.exon_features.stu.gff3 zcat gene_description.stu.csv.gz > ../../SMAP_reference/gene_description.stu.csv zcat go.stu.csv.gz > ../../SMAP_reference/go.stu.csv zcat interpro.stu.csv.gz > ../../SMAP_reference/interpro.stu.csv zcat mapman.stu.csv.gz > ../../SMAP_reference/mapman.stu.csv zcat proteome.selected_transcript.stu.fasta.gz > ../../SMAP_reference/proteome.selected_transcript.stu.fasta zcat transcripts.selected_transcript.stu.fasta.gz > ../../SMAP_reference/transcripts.selected_transcript.stu.fasta zcat cds.selected_transcript.stu.fasta.gz > ../../SMAP_reference/cds.selected_transcript.stu.fasta Create short_list of homology groups in the SMAP_reference directory:: cd ../../SMAP_reference echo "HOM05D000431" > shortlist_HOM.tsv Add as many different homology groups as you like:: echo "HOM05D000432" >> shortlist_HOM.tsv Create fasta reference files per homology group with **SMAP target-selection** (use ./ notation to indicate the path to this directory, otherwise the script tries to write files to the root of the system, where is has no write access rights).:: python3 ../smap/utilities/SMAP_target-selection.py ./annotation.selected_transcript.all_features.stu.gff3 ./stu.fasta ./genefamily_data.HOMFAM.csv stu --region 1000 --hom_groups ./shortlist_HOM.tsv Troubleshooting =============== In some cases the GFF file contains formatting errors in a subset of the genes, that cause problems when processing the GFF file with *gffutils*. In this case a quick workaround is to try to extract the gff features from the candidate genes only, and use that GFF file as input for **SMAP target-selection**. Create a short_list of candidate genes for the list of selected homology groups of the species of interest (the corresponding gene_IDs are listed in the third column of the file genefamily_data.HOMFAM.csv) :: grep -f shortlist_HOM.tsv ./genefamily_data.HOMFAM.csv | grep "stu" | cut -f3 > genelist_HOM05D000431_HOM05D000432.txt Extract the gff features for each of the candidate genes from the original GFF file :: grep -f genelist_HOM05D000431_HOM05D000432.txt ./annotation.selected_transcript.all_features.stu.gff3 > annotation.selected_transcript.all_features.HOM05D000431_HOM05D000432.stu.gff3 Run SMAP target selection again, using the short GFF file :: python3 ../smap/utilities/SMAP_target-selection.py ./annotation.selected_transcript.all_features.HOM05D000431_HOM05D000432.stu.gff3 ./stu.fasta ./genefamily_data.HOMFAM.csv stu --region 1000 --hom_groups ./shortlist_HOM.tsv Output ====== .. tabs:: .. tab:: log **SMAP target-selection** will report for how many homology groups it was not possible to extract any sequence of the given species. | ------------------------ | 2022-10-14 10:17 | ------------------------ | * Get sequences of genes in selected homology groups in fasta format... | * 0 homology group(s) are not present in ./genefamily_data.HOMFAM.csv for species "stu" and are listed in ./shortlist_HOM_not_present.txt | * Get features of genes in selected homology groups in gff format... | * 0 homology group(s) are not present in ./genefamily_data.HOMFAM.csv for species "stu" and are listed in ./shortlist_HOM_not_present.txt | * Finished | ------------------------ | 2022-10-14 10:18 | ------------------------ .. tab:: standard output By default, **SMAP target-selection** provides several types of output. .. tabs:: .. tab:: general output | SMAP target-selection creates separate files per homology group. | 1. A FASTA file with all extracted gene sequences encoded on the positive strand. | #. A GFF file containing the selected gene, exon, mRNA and CDS features. .. tab:: genome viewer | These output files can immediately be viewed by any genome browser, to check the integrity of the gene models prior to any amplicon or gRNA design. .. image:: ../images/tutorial/SMAP_target_selection_stu_selected_genes_extended_1000_bp.png .. tab:: FASTA file | Example of the FASTA file. .. csv-table:: :file: ../tables/tutorial/stu_selected_genes_extended_1000_bp.fasta :header-rows: 0 | Example of the FASTA file. .. tab:: GFF file | Example of the GFF file. .. csv-table:: :delim: tab :file: ../tables/tutorial/stu_selected_genes_extended_1000_bp.gff :header-rows: 0 ---- Identify gRNAs -------------- Install FlashFry:: mkdir ../SMAP_design cd ../SMAP_design mkdir FlashFry cd FlashFry wget https://github.com/mckennalab/FlashFry/releases/download/1.15/FlashFry-assembly-1.15.jar Concatenate all selected genes into one fasta file, and corresponding gff file:: cat ../../SMAP_reference/HOM05D000431_stu_selected_genes_extended_1000_bp.fasta ../../SMAP_reference/HOM05D000432_stu_selected_genes_extended_1000_bp.fasta > stu_selected_genes_extended_1000_bp.fasta cat ../../SMAP_reference/HOM05D000431_stu_selected_genes_extended_1000_bp.gff ../../SMAP_reference/HOM05D000432_stu_selected_genes_extended_1000_bp.gff > stu_selected_genes_extended_1000_bp.gff Create off-target database:: mkdir tmp java -Xmx4g -jar FlashFry-assembly-1.15.jar index -tmpLocation ./tmp -database stu_selected_genes_extended_1000_bp -reference stu_selected_genes_extended_1000_bp.fasta -enzyme spcas9ngg Discover gRNAs in reference sequences:: java -Xmx4g -jar FlashFry-assembly-1.15.jar discover --database stu_selected_genes_extended_1000_bp --fasta stu_selected_genes_extended_1000_bp.fasta --output stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.tsv Create scores per gRNA:: java -Xmx4g -jar FlashFry-assembly-1.15.jar score --input stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.tsv --output stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv --scoringMetrics doench2014ontarget,doench2016cfd,hsu2013 --database stu_selected_genes_extended_1000_bp Output ====== .. tabs:: .. tab:: general output | FlashFry creates lists of gRNAs throughout the entire length of the reference sequences, and can predict specificity and activity scores. .. tab:: gRNA discovery | gRNA file .. csv-table:: Example of the **gRNA file** :delim: tab :file: ../tables/tutorial/stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.tsv :header-rows: 1 .. tab:: gRNA scores | gRNA file with scores .. csv-table:: Example of the **gRNA file with scores** :delim: tab :file: ../tables/tutorial/stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv :header-rows: 1 Create pairs of amplicons and gRNAs for all genes ------------------------------------------------- Run SMAP design:: cd ../ mkdir all_genes python3 ../smap-design/smap_design/SMAPdesign.py FlashFry/stu_selected_genes_extended_1000_bp.fasta FlashFry/stu_selected_genes_extended_1000_bp.gff -g FlashFry/stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv -o stu_all_selected_genes_extended_1000_bp_output -v -gl -na 5 -al -smy -sf -aa -db mv stu_all_* all_genes/ Output ====== .. tabs:: .. tab:: log | 11:09:57 | -------------------------------------------------------- | Length Amplicons : 120-150 bp | Generate total amplicons : 150 | Generate non-overlapping amplicons: 5 | Generate gRNAs per amplicon: : 2 | -------------------------------------------------------- | 1/11 (PGSC0003DMG400023803) | 2/11 (PGSC0003DMG400025895) | 3/11 (PGSC0003DMG400023636) | 4/11 (PGSC0003DMG400030830) | 5/11 (PGSC0003DMG400023441) | 6/11 (PGSC0003DMG400026211) | 7/11 (PGSC0003DMG400016156) | 8/11 (PGSC0003DMG400017763) | 9/11 (PGSC0003DMG400010356) | 10/11 (PGSC0003DMG400013240) | 11/11 (PGSC0003DMG400029315) | Elapsed time: 00:04:51.17 | --------------------------------------------------------- .. tab:: standard output By default, **SMAP design** provides several types of output. .. tabs:: .. tab:: standard output tabular files: * a primer file (TSV file with the gene ID, primer ID and primer sequence). * a gRNA file (TSV file with the gene ID, gRNA ID and gRNA sequence). * a GFF file containing the selected primer and gRNA features (and all other features present in the genome annotation GFF file). .. tab:: primer file | primer file .. csv-table:: Example of the **primer file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_primers.tsv :widths: 20, 40, 40 .. tab:: gRNA file | gRNA file .. csv-table:: Example of the **gRNA file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_gRNAs.tsv :widths: 20, 40, 40 .. tab:: GFF file | GFF file .. csv-table:: Example of the **GFF file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output.gff3 :header-rows: 0 .. tab:: genome viewer | These output files can immediately be viewed by any genome browser, to check the distribution along the length of the gene models prior to ordering any primers for HiPlex validation or cloning gRNAs into transient expression of binary vectors. | Example of the two gene families combined in the reference. .. image:: ../images/tutorial/SMAP_design_stu_all_selected_genes_extended_1000_bp.png | These output files can immediately be viewed by any genome browser, to check the distribution along the length of the gene models prior to ordering any primers for HiPlex validation or cloning gRNAs into transient expression of binary vectors. .. tab:: optional output | Optionally, **SMAP design** also provides: | * a summary file (total number of amplicons designed by Primer3, total number of gRNAs designed, number of gRNAs after filtering). | * a summary plot showing the number of amplicons and gRNAs that were designed per gene. | * a SMAP (BED) file that is needed as input for downstream analysis with **SMAP haplotype-sites**. | * a border (GFF) file that is needed as input for downstream analysis with **SMAP haplotype-window**. | * a gRNA (FASTA) file that is needed as input for downstream analysis with **SMAP haplotype-window**. | * a debug file (GFF file) containing all amplicons designed by Primer3 and all gRNAs from the input list before filtering. | * 'all-amplicons' files (GFF file, a primer and gRNA file) containing all amplicons with their respective gRNAs (not just the non-overlapping amplicons in the output files). .. tabs:: .. tab:: summary ``-smy`` .. tabs:: .. tab:: summary file | Summary file .. csv-table:: Example of the first rows of the **summary file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_summary.tsv :header-rows: 1 .. tab:: summary plot (complete design) | Summary plot .. image:: ../images/tutorial/stu_all_selected_genes_extended_1000_bp_output_summary_plot.png | This is an example of a summary plot that SMAP design generates for a run of 11 potato genes. The box at the top shows general info on the run. In this example, 5 non-overlapping amplicons were requested per gene each with a maximum of 2 gRNAs. In total, SMAP design generated 52 amplicons. | The top left bar graph (Non-overlapping amplicons per gene) shows the number of genes in function of the number of amplicons. In this case, for all 10 genes 5 amplicons were designed. For 1 gene, 2 amplicons were designed, and no genes had 0 or 1 amplicon. | The top right bar graph (gRNAs per gene) shows the number of genes in function of the number of gRNAs. In this case, for 3 genes 10 gRNAs were created (2 gRNAs * 5 amplicons per gene). for some other genes, between 7-9 guides were created. 1 gene had only 3 gRNAs. | The middle left graph (gRNAs per amplicon) shows the number of amplicons in function of the number of gRNAs. In this case, 40 of the amplicons that were designed covered 2 gRNAs, and 12 amplicons covered 1 gRNA. | The middle right graph shows the underlying reason for not retaining amplicons. Four cases are possible: 1) no gRNAs were designed for the gene or no gRNAs passed the filters; 2) no amplicons were designed by Primer3; 3) the gene is too short to design any amplicons of the desired length; 4) there is no overlap between the gRNAs and amplicons. | In this example, SMAP design was successful for all 11 genes, which is why the graph is empty. | The lower graph (Amplicons with and without gRNAs) shows the number of amplicons that were designed per gene. The grey bar shows the amplicons that were designed by Primer3 that did not overlap with any gRNAs (and are therefore discarded). The green bar shows the number of amplicons designed by Primer3 which did overlap with at least 1 gRNA. By adding both grey and green bars, the total number of amplicons designed by Primer3 per gene can be obtained. In this example a maximum of 150 amplicons was requested. The black points show the length of the gene in basepairs. .. tab:: SMAPfiles ``-sf`` .. tabs:: .. tab:: SMAP BED file | SMAP BED file .. csv-table:: Example of the **SMAP BED file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_SMAPs.bed :header-rows: 0 .. tab:: border GFF file | border GFF file .. csv-table:: Example of the **GFF file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_borders.gff3 :header-rows: 0 .. tab:: gRNA FASTA file | gRNA FASTA file .. csv-table:: Example of the **gRNA FASTA file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_gRNAs.fasta :header-rows: 0 .. tab:: allAmplicons ``-aa`` .. tabs:: .. tab:: allAmplicons border file | allAmplicons border file .. csv-table:: Example of the **allAmplicon border file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_allAmplicons_borders.gff3 :header-rows: 0 .. tab:: allAmplicons GFF3 file | allAmplicons GFF3 file .. csv-table:: Example of the **allAmplicon GFF3 file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_allAmplicons.gff3 :header-rows: 0 .. tab:: allAmplicons gRNA file | allAmplicons gRNA file .. csv-table:: Example of the **allAmplicon gRNA file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_allAmplicons_gRNAs.tsv :header-rows: 0 .. tab:: allAmplicons gRNA FASTA file | allAmplicons gRNA FASTA file .. csv-table:: Example of the **allAmplicon gRNA FASTA file** :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_allAmplicons_gRNAs.fasta :header-rows: 0 .. tab:: allAmplicons primers file | allAmplicons primers file .. csv-table:: Example of the **allAmplicon primers file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_allAmplicons_primers.tsv :header-rows: 0 .. tab:: allAmplicons SMAP BED file | allAmplicons SMAP BED file .. csv-table:: Example of the **allAmplicon SMAP BED file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_allAmplicons_SMAPs.bed :header-rows: 0 .. tab:: debug ``-db`` .. tabs:: .. tab:: debug file | debug file .. csv-table:: Example of the **debug file** :delim: tab :file: ../tables/tutorial/stu_all_selected_genes_extended_1000_bp_output_debug.gff3 :header-rows: 0 Create pairs of amplicons and gRNAs for a subset of genes --------------------------------------------------------- Create short_list of genes:: cd ../ echo "PGSC0003DMG400023803" > shortlist_target_genes.tsv Add as many different homology groups as you like:: echo "PGSC0003DMG400025895" >> shortlist_target_genes.tsv echo "PGSC0003DMG400023636" >> shortlist_target_genes.tsv Run SMAP design:: mkdir selected_genes python3 ../smap-design/smap_design/SMAPdesign.py FlashFry/stu_selected_genes_extended_1000_bp.fasta FlashFry/stu_selected_genes_extended_1000_bp.gff -g FlashFry/stu_selected_genes_extended_1000_bp_guides.fasta.off_targets.scores.tsv -o stu_selected_genes_extended_1000_bp_output -sg shortlist_target_genes.tsv -v -gl -na 5 -al -smy -sf -aa -db mv stu_selected_* selected_genes/ Output ====== **SMAP design** will report the parameter settings and for how many genes it was not possible to design any amplicons. The output is the same as described above, but limited to the candidate genes. | 10:51:26 | -------------------------------------------------------- | Length Amplicons : 120-150 bp | | Generate total amplicons : 150 | | Generate non-overlapping amplicons: 5 | | Generate gRNAs per amplicon: : 2 | | -------------------------------------------------------- | 1/3 (PGSC0003DMG400023803) | 2/3 (PGSC0003DMG400025895) | 3/3 (PGSC0003DMG400023636) | ../smap-design/smap_design/SMAPdesign.py:1596: UserWarning: FixedFormatter should only be used together with FixedLocator ax_ampliconGene.set_xticklabels(GeneLabel, | | Elapsed time: 00:01:21.27 | -------------------------------------------------------- .. _SMAPtutorial_smap_design_microalgae: ####################################################################### Design a CRISPR/Cas experiment with gRNA and HiPlex panels (microalgae) ####################################################################### Activate your virtual environment :: source .venv/bin/activate Create the reference sequences ------------------------------ Define the path to the download directory:: download_dir=$/PATH/TO/DOWNLOAD_DIR/ Or, create a new download dir:: cd ../ mkdir download cd download download_dir=$(pwd) Download genomes and gene annotation data:: cd $download_dir mkdir originals cd originals wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.HOMFAM.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GeneFamilies/genefamily_data.ORTHOFAM.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Genomes/stu.fasta.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.all_features.stu.gff3.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GFF/stu/annotation.selected_transcript.exon_features.stu.gff3.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Annotation/annotation.selected_transcript.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/IdConversion/id_conversion.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Descriptions/gene_description.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/GO/go.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/InterPro/interpro.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/MapMan/mapman.stu.csv.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/proteome.selected_transcript.stu.fasta.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/transcripts.selected_transcript.stu.fasta.gz wget https://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_05/Fasta/cds.selected_transcript.stu.fasta.gz ##################################################################################### Genetic diversity and composition analysis for wild and cultivated *Coffea canephora* ##################################################################################### In this tutorial we analyse population structure and composition and establish a core collection of a wild coffee population using molecular markers derived from GBS and HiPlex NGS read data. By the end of the tutorial you are able to employ some commonly used methods (PCA, fastSTRUCTURE, genetic distance matrix) to quantify genetic diversity and detect population structure. Additionally, you are able to use the Maximization strategy and distance method to establish a core collection. Characterization of the population structure and establishment of a core collection of wild *C. canephora* using GBS SNP markers -------------------------------------------------------------------------------------------------------------------------------- Sample set ========== To characterize the genetic structure of a wild coffee population, 256 coffee samples were collected over 24 forest inventory plots of 125 x 125m, covering an area of ca. 50-by-20km in the Yangambi region in the Democratic Republic of the Congo. The inventory plots were previously classified by `Depecker et al. (2022) `_, into three different forest categories: * 8 plots in regrowth forest * 7 plots in disturbed old-growth forest * 9 plots in undisturbed old-growth forest Used software ============= * `GBprocesS `_ * `Python (3.8.13) `_ * `SMAP package (4.6.5) `_ * Docker (v23.0.4) * GATK docker image (v3.7.0) * VCFtools (v0.1.16) * PLINK (v1.90) * fastSTRUCTURE (v1.0) * RStudio GBprocesS ========= Before we can map the read data to a reference genome, demultiplexing of the Genotyping-By-Sequencing libraries was done using GBprocesS. GBprocesS uses a config.ini file as the configuration syntax for demultiplexing, trimming, merging forward and reverse reads, base quality filtering, and removal of reads with internal restriction sites. The structure of the config.ini file and more explanation can be found in the `GBprocesS manual `_. Genomic DNA of the wild coffee samples was digested with restriction enzymes PstI and MseI. Due to the large sample set, library preparation was separated over two batches, whereby the sample-specific barcodes were reused, and each batch was amplified with a different set of i5 and i7 Illumina barcodes and added to Illumina flowcell binding sites during PCR. The cutsite remnants of the applied restriction enzymes can be found on `NEB `_ by looking up the RE recognition site. In this case the cutsite remnant of PstI is CTGCA and the cutsite remnant of MseI is TTNN. The barcoded and common adapters were based on Truseq sequencing primers, of which the sequence for trimming is stored in the GBprocesS code. * :download:`Example config.ini file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/01_GBprocesS/config.ini>` * :download:`Example barcode fasta file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/01_GBprocesS/barcode.fasta>` Mapping reads with BWA-MEM ========================== After demultiplexing, the merged reads can be directly mapped onto the reference genome sequence. Here, we use the publicly available `reference genome of Coffea canephora `_ (Denoeud et al., 2014). Reads of all samples were mapped onto the reference genome using Burrow-Wheeler Aligner Maximal Exact Match (`BWA-MEM `_) * :download:`BWA-MEM python script <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/02_mapping/map_BWA_parallel.py>` .. code-block:: ruby source /path/to/python_virtual_environment/bin/activate python3 map_BWA_parallel.py -d /path/to/merged_reads_directory/ \ -s .fq \ /path/to/reference_genome_fasta_file \ -o /path/to/output_directory \ -p 16 \ -l 03_map_BWA_parallel_P1.log date SMAP delineate ============== The distribution of teh mapped GBS reads can be checked by running `SMAP delineate `_ on the BAM files of the entire sample set (256 samples). .. code-block:: ruby smap delineate /path/to/BAM \ -mapping_orientation ignore \ --processes 16 \ --plot all \ --plot_type png \ --min_stack_depth 4 \ --max_stack_depth 400 \ --min_cluster_depth 9 \ --max_cluster_depth 400 \ --completeness 80 SMAP delineate gives us insight into read mapping quality, locus length distribution, locus read depth distribution, the number of loci per sample (saturation curve) and locus completeness across the sample set. .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/03_smap_delineate/StackCluster.Saturation.scatter.png 18 644 loci with completeness >80% SNP calling and filtering ========================= .. tabs:: .. tab:: SNP calling The Unified Genotyper tool of GATK was used to perform the SNP calling by comparing the mapped reads to the reference genome of *C. canephora*. We used the .bed file provided by SMAP delineate to only call SNPs within the 18644 loci with completeness higher than 80% across all 256 samples. .. code:: ruby cd /path/to/bamfiles_folder #Make a string variable that contains all input .bam files BAM_FILES=( "*".BWA.bam ) for i in "${BAM_FILES[@]}" do #make a variable STRING that contains all bam files, including the "-I" option that needs to precede the sample name in the GATK command STRING+='-I ' STRING+="$i " done #Run docker docker run -w /path/to/bamfiles_folder/ \ -v /path/in/host/to/link/:/path/in/container/to/link/ broadinstitute/gatk3:3.7-0 java \ -jar /usr/GenomeAnalysisTK.jar \ -R /path/to/reference_genome_fasta_file.fasta \ -T UnifiedGenotyper \ $STRING \ -o /path/to/output_directory/output_name.vcf \ -L /path/to/path_to_bed_file Output: 239.007 SNPs .. tab:: SNP filtering **SNP and sample quality filter** This command filtered on: * Minimum average depth per sample of 30 * Minimum allele count of 4 * Minimum SNP quality of 20 * Minimum genotype quality of 30 * Minimum depth per sample of 10 .. code:: ruby vcftools --vcf /path/to/output_SNPcalling.vcf --min-meanDP 30 --mac 4 --minQ 20 --minGQ 30.0 --minDP 10.0 --recode --out output_SNP_and_sample_quality_filtered.vcf **Polymorphic SNP filtering** The SelectVariants tool of GATK was used to retain polymorphic SNPs. .. code:: ruby docker run -v /path/in/host/to/link/:/path/in/container/to/link/ broadinstitute/gatk3:3.7-0 java \ -jar /usr/GenomeAnalysisTK.jar \ -T SelectVariants \ -R /path/to/reference_genome_fasta_file.fasta \ -V /path/to/SNP_and_sample_quality_filtered.vcf \ --excludeNonVariants \ -o output_polymorphic_SNPs_filtered.vcf **Biallelic SNP filtering** The SelectVariants tool of GATK was used to only retain biallelic SNPs. .. code:: ruby docker run -v /path/in/host/to/link/:/path/in/container/to/link/ broadinstitute/gatk3:3.7-0 java \ -jar /usr/GenomeAnalysisTK.jar \ -T SelectVariants \ -R /path/to/reference_genome_fasta_file.fasta \ -V /path/to/output_polymorphic_SNPs_filtered.vcf \ -restrictAllelesTo BIALLELIC \ -o output_biallelic_SNPs_filtered.vcf Output filtered SNPs: 3212 SNPs Warning: GATK will use the file name of the bam file which is defined within the bam file as sample name. By changing the name of the bam file name manually or via the command line, the sample will not be adapted in the .vcf file. Therefore, if you want to work with shorter or more informative names in further analyses, we recommend to change the samples after SNP calling in the obtained .vcf file using :download:`this customized python script <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/04_SNP_calling_and_filtering/convert_vcf_header.py>` and an excel file containing the original .bam file names in one collumn and the preferred names in another column. Checking for genetically identical individuals ============================================== Before analyzing the population structure, we want to make sure that every individual is a genetically unique individual. Normally DNA of every individual was extracted and sequenced once, but possible labeling errors during plant maintenance, sampling, DNA extraction,library preparation, or sequencing, may lead to identical fingerprints across the sample set, or the identification of clonal materials in the collection. Therefore, we check if each indiviudal in the sample set is unique by calculating the pairwise Jaccard Inversed Distance (JID) between all pairs of individuals using `SMAP grm `_. Haplotype calling ~~~~~~~~~~~~~~~~~ SMAP grm uses haplotype calls of each individual to calculate the pairwise Jaccard Inversed Distance (JID). Therefore, read-backed haplotyping was conducted based on the combined variation in SMAPs (obtained by SMAP delineate) and SNPs (obtained by GATK) using the `SMAP haplotype-sites `_: module. .. code-block:: ruby smap haplotype-sites /path/to/BAM /path/to/bed /path/to/vcf \ -mapping_orientation ignore \ --partial include \ --processes 16 \ --plot_type png \ --no_indels \ --min_distinct_haplotypes 2 \ --min_read_count 10 \ --min_haplotype_frequency 2 \ --discrete_calls dosage \ --frequency_interval_bounds 10 10 90 90 \ --dosage_filter 2 Output: 22602 haplotypes within 18644 loci Genetic similarity ~~~~~~~~~~~~~~~~~~ SMAP grm code: .. code-block:: ruby smap grm --table /path/to/haplotype_table \ --input_directory /path/to/directory/of/haplotype/table \ --output_directory /path/to/output/directory/ \ --samples /path/to/sample_ID_file \ --processes 16 \ --plot_type png \ --locus_completeness 0.1 --similarity_coefficient Jaccard \ --locus_information_criterion Shared \ --partial --print_sample_information All \ --print_locus_information All \ --plot_format png \ Alternative: you can also calculate the JID for a subset of individuals. To do so, you only list the individuals of the subset in the Sample_ID.txt file specified by the parameter --samples. The order of the samples within the JID matrix will be specified by the order of the individuals listed in the Sample_ID.txt file. You can also give your individuals an alternative name in the JID matrix by adding a second column to the sample ID text file indicating the new names. First, we look at the distribution of the pairwise JID values to set a threshold to identify all pairs of genetically identical samples. .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/04_Genetic_Similarity/JID_distribution.png Distribution of JID values across all pairwise comparisons revealed a group of samples with pairwise JID in the range 0.974–1, and a separate group of sample pairs with pairwise Jaccard Inversed Distance ranging between 0.64 and 0.89. A JID of 1 means identical haplotype calls at all detected loci (i.e., genetically identical). In practice, a Jaccard Inversed Distance in the range of 0.974 to 1, means that one HiPlex locus out of all XX detected loci in the sample pair may display a different haplotype constitution between two samples. (for instance a single instance of a false negative or false positive detection of a haplotype, due to technical errors). Therefore, the minimal JID of 0.974 (for GBS data) was used as a threshold to identify all pairs of genetically identical samples (i.e., clones). Second, genetically identical samples were identified using the minimal JID values of 0.974. .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/04_Genetic_Similarity/JID_matrix.png Seven pairs of samples (indicated in red) had a JID higher than 0.974 and were therefore identified as genetically identical samples. As *Coffea canephora* is a self-incompatible species we did not expect genetically identical samples in our sample set. By looking further at the genetically identical sample pairs, we discovered some repeats that indicated errors had been made during the creation of the GBS libraries in the lab. To prevent bias from these identical pairs in the population structure analyses, one individual of each identical sample pair (7 in total) was removed from the sample set. Investigating the genetic structure and composition =================================================== Principal component analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A principal component analyses performed on the filtered 3212 SNPs for the remaining 249 wild *C. canephora* samples, will give us a first impression of the population structure within the wild coffee population. An annotation file is needed to annotate the samples with more information, such as sample ID, sampling location, location information, etc. The order of the samples in the annotation file needs to be the same as in the .vcf file .. tabs:: .. tab:: PCA R-script * :download:`PCA R-script <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/05_PCA/pca.R>` * :download:`Example anotation script <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/05_PCA/Annotations.txt>` .. code-block:: r ``` R script for PCA with SNP data ``` # *Install and load required packages* library(vcfR) library(adegenet) library(ape) library(tidyverse) # *Set working directory* setwd("/path/to/working_directory/") getwd() # *Read vcf file* vcf <- read.vcfR("path/to/.vcf") ##Check vcf file head(vcf) vcf@fix[1:10,1:5] ## Convert vcf to Genlight object* gen <- vcfR2genlight(vcf) ## Add real SNP names locNames(gen) <- paste(vcf@fix[,1],vcf@fix[,2],sep="_") # *Perform a PCA* pca <- glPca(gen, nf=100) #principal component analysis with 100 PCs plot(pca$scores) pca_scores <- as.data.frame(pca$scores) ## Check the proportion of variance explained by the first three PC axes pca$eig[1]/sum(pca$eig) pca$eig[2]/sum(pca$eig) pca$eig[3]/sum(pca$eig) # Create a dataframe with the annotation file and PCA scores ## Import annotation file annotations <- read.delim("Annotations.txt", sep = "\t", header = TRUE) ### WARNING: The order and names of the individuals must be the same as in the VCF file. ## Create a new dataframe containing a first column wiht Sample_IDs and other columns containing the PCA scores pca_annotated <- cbind(annotations, pca$scores[,c(1:3)]) # Plot the PCA pca_plot <- ggplot(pca_annotated,aes(x = PC1,y = PC2, color = Plot_nr, shape = Forest_disturbance))+ geom_point(size=2.5)+ #scale_color_manual(values = c("blue", "red", "green"))+ ggtitle("Wild C. canephora population")+ xlab("PC1 (5.1%)")+ ylab("PC2 (3.2%)")+ theme_bw() pca_plot write.table(pca_annotated, "pca_scores.txt", sep="\t", row.names = FALSE) png("output_file_name.png", width = 900, height = 600) pca_plot dev.off() .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/05_PCA/pca_plot.png Three different clusters could be identified using the PCA analyses. All samples collected in plot 3 clustered together on the positive PC1-axis and all samples collected in plot 1 and 2 clustered together on the negative PC2-axis. All other samples (plot 4-25) clustered together on the negative PC1-PC2-axis. As the samples are clustered together based on their geographical location, rather than forest disturbance, we have an indication that the geographical location could be a main driver of genetic structure in this wild coffee population. To support this hypothesize, we will analyze the genetic structure using the fastSTRUCTURE analysis. fastSTRUCTURE ~~~~~~~~~~~~~ To visualize the genetic structure and genetic composition of the wild coffee population, fastSTRUCTURE was run using a PLINK formatted genotype call file. .. code:: ruby vcftools --vcf /path/to/SNP_filtered_vcf_file.vcf --plink --out output_file_name Converting .vcf file to .plink format creates a .map and .ped file. The .map file contains information about the SNP: Chromosome name, SNP name, Position in Morgans or centiMorgans (optional) and basepair coordinates. However, when our .vcf file was converted to the .map file, the column with the Chromosome names consisted of zeros instead of the chromosome numbers. Without adjusting the first column the fastSTRUCTURE analyses will give an error. Therefore, the chromosome numbers were manually added to the .map file based on the SNP name. * :download:`map file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.map>` The .ped file contains information about the sample set: Family ID, Individual ID, Father ID, Mother ID, Sex, Phenotype, Rest of the columns: Genotypes for each SNP. However, when our .vcf file was converted to the .ped file, the column which should contain the family ID contained also the individual ID. This is because the .vcf file does not contain any population or sample information besides the sample ID. Therefore, the plot information was manually added to the .ped file based on the annotation file used for the principal component analyses. WARNING: make sure there are no spaces in the family name otherwise plink will read this as too many tokens. * :download:`map file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.map>` Next, the .ped and .map file are converted to a .bed file .. code:: ruby plink --file /path/to/vcf_to_plink_output --make-bed --out output_file_name Using this .bed file, we can run the fastSTRUCTURE analysis. As we currently not know how many clusters are present in our sample set we will loop the analyses for 2 to 9 clusters. .. code:: ruby conda activate faststructure for i in `seq 2 9`; do structure.py --input=/path/to/plink_bed_file(no_suffix) --output=output_file_name -K $i --full --seed=100; done For each cluster, 5 output files are created: * .log :download:`example log file (K=4) <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.4.log>` * meanQ: mean ancestry proportion per sample :download:`example meanQ (K=4) <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.4.meanQ>` * meanP: mean error rate of proportion per sample :download:`example meanP (K=4) <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.4.meanP>` * varQ: variance of mean ancestry proportion per sample :download:`example varQ (K=4) <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.4.varQ>` * varP: variance of mean error rate of proportion per sample :download:`example varP (K=4) <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/wild_coffee.4.varP>` Note: Each line in the output file corresponds with a sample but the sample name is not given in the output. The order of the samples in the output files is the same as the order in the vcf and plink files. Each column in the output file corresponds to a cluster. The chooseK.py script was used to get the optimal number of clusters, which is printed to the terminal. .. code:: ruby conda activate faststructure chooseK.py --input *path/to/output_name_fastSTRUCTURE(no_suffix)* output: Model complexity that maximizes marginal likelihood = 4 The optimal number of clusters found in the wild coffee population was 4. A new file was manually made were the meanQ results belonging to the K = 4 were complemented with a sample ID, plot number and forest disturbance info so that the fastSTRUCTURE result can be visualized in RStudio. .. tabs:: .. tab:: visualizing fastSTRUCTURE * :download:`input faststructure file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/R_plot_faststructure.R>` * :download:`visualizing fastSTRUCTURE using R-script <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/R_plot_faststructure.R>` .. code-block:: r ``` R script to visualize fastSTRUCTURE results ``` #Install and load required packages library(adegenet) library(tidyverse) #Set working directory setwd("/path/to/working/directory") getwd() #Read manually made faststructure file fs <- read_delim("R_plot_faststructure.txt", delim = "\t", col_names = TRUE) #Making the barplot ggplot(fs, aes(fill=Cluster, y=Value, x=Number)) + geom_bar(position="stack", stat="identity", width = 1) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank())+ facet_grid(~ Forest, scales = "free_x", space = "free_x", switch = "x") .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/fastSTRUCTURE_K=4.png The fastSTRUCTURE analysis showed that the plots in undisturbed old-growth forest were divided into two subpopulations, namely plot 16–19 and plot 21–25, with individuals in plots 16–19 showing a mixture of DAPC cluster 1 and 4. Plots in disturbed old-growth forest were divided into four subpopulations, namely: plots 1 and 2; plot 3; plots 5 and 15; plots 11 and 12. Plots in regrowth forest were divided into three subpopulations, namely: plot 4; plot 10; plots 6–9, 13 and 14. Individuals in plot 10 showed to be a mixture of clusters 1 and 4, whereas individuals in plot 4 showed a mixture of all DAPC clusters. Overall, clustering of the samples from old-growth forest was differentiated according to the geographical location of the plots. Isolation-by-distance ~~~~~~~~~~~~~~~~~~~~~ The PCA and fastSTRUCTURE showed genetic structure within the wild coffee population mainly driven by geographical location. To see if this could be a result of isolation-by-distance (IBD), we will calculate genetic distances between the plots, pairwise Fst values (Weir and Cockerham, 1984), using PLINK 1.9 and correlated these genetic distances to the geographical distances between the plots. Therefore, we will use the plink files (also used for fastSTRUCTURE analysis), a list of all plots and a python script: .. tabs:: .. tab:: pairwise FST * :download:`list of all plots <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/07_IBD/list_plots.txt>` * :download:`python script to calculate pairwise fst <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/07_IBD/pairwise_fst_plink.py>` .. code-block:: ruby popfile = "list_lots.txt" #List of populations - one per line plinkfile = "wild_coffee" #Just the bit before the .bim etc. import os #Add pop names to a list dd = open(popfile) pops = [] for line in dd: pops.append(line.strip()) dd.close() output = open("pairwise_fst.txt","w") #Now do a pairwise loop over populations for i in range(0,len(pops)-1): for j in range(i+1,(len(pops))): temp = open("temp.txt","w") temp.write(pops[i] + ("\n") + pops[j]) temp.close() os.system("plink --bfile " + plinkfile + " --fst --family --keep-fam temp.txt --out temp --autosome-num 37") cd = open("temp.log") for line in cd: if line.startswith("Mean Fst"): output.write(pops[i] + ("\t") + pops[j] + ("\t") + line.split()[3] + ("\n")) break os.system("rm temp.*") output.close Output: a text file with the pairwise Fst value (column 3) between the first plot (column 1) and second plot (column 2) To visualize the comparison between plots from different forest disturbance levels (undisturbed, disturbed, and regrowth) an extra column will be manually added to the text file indicating the level of disturbance per plot. .. tab:: Isolation-by-distance * :download:`input pairwise Fst file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/07_IBD/list_plots.txt>` * :download:`input latitude and longitude of each plot <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/07_IBD/latlong_no20.txt>` * :download:`IBD R-script <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/07_IBD/IBD.R>` ..code-block:: r ``` R script to visualize IBD ``` #Install and load required packages library(tidyverse) library(ggplot2) library(ggthemes) library(geosphere) #Set working directory setwd("/path/to/working/directory") #Read pairwise Fst file pfst <- read_delim("pairwise_fst.txt", delim = "\t",col_names = T) # read in pairwise Fst file pfst colnames(pfst) <- c("group", "p1","p2","FST") # give the columns a name pfst ll <- read_delim("latlong_no20.txt", delim = "\t") ll #Calculate pairwise distance between each of our populations ##Create an empty distance column pfst$dist <- NA pfst ##Loop over the pfst data frame for(i in 1:nrow(pfst)) { ###Extract population names from pd data frame pop1 <- pfst$p1[i] #First element of p1 column pop2 <- pfst$p2[i] #First element of p2 column ###Get lat and long information l1 <- subset(ll,Population == pop1) l2 <- subset(ll,Population == pop2) ###Now calculate distance pfst$dist[i] <- distGeo(p1 = cbind(l1$Long,l1$Lat), p2 = cbind(l2$Long,l2$Lat)) } ##Check that it worked head(pfst) ## Save table write.csv(pd, file="pd.csv") #Plot pairwise Fst vs geographical distance between pairs of plots ggplot(pfst,aes(x = dist/1000,y = FST, color = pfst$group))+ #divide by 1000 to get distance in km geom_point(aes(size = 0.5))+ scale_color_manual(values = c("#990000","#FFCC99","#FF6600","#333999", "#99CCFF", "#CCCCFF"))+ theme_bw()+ xlab("Distance (km)")+ ylab(expression("F"[ST]))+ geom_abline()+ geom_smooth(method = lm, se =FALSE, colour = "black")+ xlim(NA,45)+ ylim(0,0.14)+ guides(color=guide_legend(title="Forest disturbance"), size="none") .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/06_fastSTRUCTURE/fastSTRUCTURE_K=4.png Isolation-by-distance was found across all plots explaining the genetic structure found in the wild coffee population. Establishing a core collection ============================== The wild coffee population in the Yangambi region harbours a high degree of genetic diversity and therefore is a very important genetic resource. To ensure that this high genetic diversity will be preserved, we could opt to create a core collection of this wild population. Genetic data, phenotypical data or agricultural traits can all be taken into account to select the best samples for the core collection, but for this wild coffee population, only the genetic data is currently available. Core Hunter 3 v3.2.0 R package (De Beukelaer et al., 2018) was used to test two core collection strategies for the 248 samples of the wild coffee population: * Maximization strategy (CC-I): maximizes the genetic diversity by selecting the most diverse loci * Genetic distance method (CC-X): maximizes the genetic distance between the entries of the core collection and selecting the most diverse plant material within the wild coffee population For both core collection strategies, allele coverage (CV), diversity within and between alleles [He and Shannon’s index (SH)], and average genetic Modified Roger’s (MR) distance between entry-accession (AN) and entry-to-nearest-entry (EN) were calculated for nine different core sizes (3, 5, 10, 25, 50, 100, 150, 200, and 249 accessions). The filtered .vcf file will be manually converted to the format needed for CoreHunter. The genotype calls (0/0, 0/1 and 1/1) will be transformed into diploid dosage (0,1,2) and the SNPs rows will be transposed with the sample columns whereby the columns contain data per SNP and the rows contain data per sample. Additionally, if a sample had no genotype call for a specific SNP (./.) this cell will be left empty. .. tabs:: .. tab:: R script creating core collections * :download:`Example CoreHunter input file <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/08_core_collection/Input_CoreHunter.txt>` * :download:`R-script CoreHunter <../../download_files/tutorial/Coffea_canephora/01_wild_coffee/08_core_collection/tutorial_corehunter.R>` ..code-block:: r ``` R script to visualize IBD ``` # Install and load required packages library(rJava) library(corehunter) library(ggplot2) library(patchwork) library(cowplot) library(tidyverse) # *Import genotype data file* genoo <- genotypes(file = "Input_CoreHunter.txt", format = "biparental") # *Create CoreHunter data file* my.data <- coreHunterData(genoo) my.data # *Test Maximization strategy (CC-I)* ## Create empty dataframe to add diversity and distance measures score_CCI <- data.frame("Size" = numeric(0), "Average entry-to-nearest-entry distance" = numeric(0), "Average accession-to-nearest-entry distance" = numeric(0), "Shannon's Index" = numeric(0), "Expected heterozygosity"=numeric(0), "Allele coverage"= numeric(0)) for (i in seq(2,(length(genoo$ids)-1),10)) { ## Create core collection with desired size core <- sampleCore(my.data, objective("AN", "MR"), size = i, steps = 500) ## Calculate diversity and distance measures within the core collection EN_MR <- evaluateCore(core, my.data, objective("EN", "MR")) AN_MR <- evaluateCore(core, my.data, objective("AN", "MR")) CV <- evaluateCore(core, my.data, objective("CV")) SH <- evaluateCore(core, my.data, objective("SH")) HE<- evaluateCore(core, my.data, objective("HE")) score_CCI[nrow(score_CCI)+1,] <- data.frame('Size' = i, "Average entry-to-nearest-entry distance" = EN_MR, "Average accession-to-nearest-entry distance" = AN_MR, "Expected heterozygosity" = HE, "Shannon's Index" = SH, "Allele coverage" = CV) } # *Test distance method (CC-X)* ## Create empty dataframe to add diversity and distance measures score_CCX <- data.frame("Size" = numeric(0), "Average entry-to-nearest-entry distance" = numeric(0), "Average accession-to-nearest-entry distance" = numeric(0), "Shannon's Index" = numeric(0), "Expected heterozygosity"=numeric(0), "Allele coverage"= numeric(0)) for (i in seq(2,(length(genoo$ids)-1),10)) { ## Create core collection with desired size core <- sampleCore(my.data, objective("EN", "MR"), size = i, steps = 500) ## Calculate diversity and distance measures within the core collection EN_MR <- evaluateCore(core, my.data, objective("EN", "MR")) AN_MR <- evaluateCore(core, my.data, objective("AN", "MR")) CV <- evaluateCore(core, my.data, objective("CV")) SH <- evaluateCore(core, my.data, objective("SH")) HE <- evaluateCore(core, my.data, objective("HE")) score_CCX[nrow(score_CCX)+1,] <- data.frame('Size' = i, "Average entry-to-nearest-entry distance" = EN_MR, "Average accession-to-nearest-entry distance" = AN_MR, "Expected heterozygosity" = HE, "Shannon's Index" = SH, "Allele coverage" = CV)} # *Evaluation of the diversity and distance measures* color_list <- c("#339999", "#006699", "#FF9966", "#CC6666", "#66CC66" ) measure_list <- c("Average entry-to-nearest-entry distance", "Average accession-to-nearest-entry distance", "Expected heterozygosity", "Shannon's Index", "Allele coverage") plots <- list() count <- 1 ## Create plots for all calculated diversity and distance measures ### WARNING: this codes needs to be run twice: once for CC-I and once for CC-X for (measure in colnames(score_CCX[2:6])) { subtitle <- measure_list[count] plot <- ggplot(score_CCX, aes(x=Size, y= .data[[measure]]))+ geom_point(colour = color_list[count])+ geom_line(colour = color_list[count])+ scale_x_continuous(breaks = seq(min(score_CCX$Size), max(score_CCX$Size), by = 20))+ ggtitle(measure_list[count])+ theme_bw()+ theme(legend.position = "none", plot.title = element_text(color = color_list[count], size = 12, hjust = 0.5), axis.title.y = element_blank()) plots[[measure]] <- plot count <- count + 1 } combined_plot <- wrap_plots(plots, ncol=2)+ plot_annotation(title = "Core collection CC-X", theme = theme(plot.title = element_text(hjust=0.5))) combined_plot # *Create core collection with optimized core size* ## Core collection CC-I CCI <- sampleCore(my.data, objective("AN", "MR"), size = 122) CCI_samples <- data.frame('Sample_ID' = CCI$sel) ##save sample ids of the selected samples ## Core collection CC-X CCX <- sampleCore(my.data, objective("EN", "MR"), size = 22) CCX_samples <- data.frame('Sample_ID' = CCX$sel) ##save sample ids of the selected samples # *Visualization of core collection samples using PCA* ## Here we will use the pca values, which were already calculated in a previous analysis ## Import pca scores pca <- read_delim("pca_scores.txt", delim = "\t", col_names = TRUE) ## Create a column in a new dataframe with the information if the individual is selected as ## an entry for core collection CCI, CCX or both cc_info <- data.frame('Sample_ID' = as.character(pca$Sample_ID)) cc_info <- Collection %>% mutate(Collection = case_when( cc_info$Sample_ID %in% CCI_samples$Sample_ID & !(cc_info$Sample_ID %in% CCX_samples$Sample_ID) ~"CC-I", cc_info$Sample_ID %in% CCX_samples$Sample_ID & !(cc_info$Sample_ID %in% CCI_samples$Sample_ID) ~"CC-X", cc_info$Sample_ID %in% CCI_samples$Sample_ID & cc_info$Sample_ID %in% CCX_samples$Sample_ID ~"CC-I and CC-X", TRUE ~ "Not in core collections" )) ## Merge the new dataframe with the pca values pca_info <- merge(cc_info, pca, by = "Sample_ID") ## Create plot pca_plot <- ggplot(pca_info, aes(PC1, PC2, color = Collection))+ geom_point(alpha = 0.4, size = 3)+ xlab("PC1 ()")+ ylab("PC2 ()")+ ggtitle("PCA core collections of wild coffee population")+ guides()+ theme_bw() pca_plot .. tab:: Evaluation plots **CC-I** For CC-I, an optimal core size was determined based on maximized genetic diversity (He and SH) and minimized AN distance. .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/08_core_collection/Evaluation_CCI.png The optimal core size comprised 100 unique genetic fingerprints as the genetic diversity was higher than other core sizes (He of 0.20 and SH of 5.75), all alleles were accounted for (CV of 1) and entry-to-accession distance was low (AN of 0.13) **CC-X** For CC-X, an optimal core size was determined based on maximized genetic diversity (He and SH) and maximized EN distance. .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/08_core_collection/Evaluation_CCX.png The optimal core size comprised 10 unique genetic fingerprints as the genetic diversity (He of 0.23 and SH of 5.79) was higher than other core sizes tested for CC-X, almost all alleles were accounted for (CV of 0.93) and entry-to-nearest-entry distance was high (EN = 0.35). Using the values of the PCA, which we had created earlier, we can show which of the samples were selected for the core collections CC-I and CC-X .. image:: ../images/tutorial/00_Coffea_canephora/01_wild_coffee/08_core_collection/pca_core_collection_plot.png On the plot we can see that accessions assigned to the CC-I and CC-X core collection were evenly distributed across the ordination space of the wild coffee population. Characterization of the genetic composition and establishment of a core collection of cultivated *C. canephora* using HiPlex SNP markers ---------------------------------------------------------------------------------------------------------------------------------------- Purpose ======= The genetic background of a *Coffea canephora* cultivar collection grown in the Congo Basin was analysed to determine the different genetic sources of the crop. HiPlex sequencing using 86 loci was performed on 270 *C. canephora* samples and 9 samples were genotyped using GBS. In addition, HiPlex sequencing with the same loci was performed on 48 samples of local wild *C. canephora* and 35 samples of a variety of coffee species. At last, whole genome shotgun (WGS) sequencing data of two reference samples of the Congolese subgroup A was retrieved from SRA (`Bioproject PRJNA803612 `_). After sequencing, the following workflow was executed to visualize the population structure of coffee in the Congo Basin: * Reads were mapped to the `reference genome `_ of *C. canephora* using Burrow-Wheeler Aligner Maximal Exact Match (`BWA-MEM `_) * :ref:`Delineate` * :ref:`SNP` with GATK and BCFtools * Determining haplotypes by executing :ref:`Haplotype` * Analysis of the population structure in four different ways: * First indication of the genetic structure using principal component analysis (:ref:`PCA`) * Genetic clusters of the samples set with :ref:`fastStruc` * Genetic distances between samples using SMAP genetic relationship matrices (:ref:`grm`) * Creation of a :ref:`phylo` * Creating a :ref:`core` Installation of software ======================== Downloading the specific software versions used in this tutorial will ensure all displayed commands operate properly. The latest software versions can also be downloaded. .. tabs:: .. tab:: Specific versions * `Python (3.8.13) `_ * `SMAP package (4.6.5) `_ * Using wget and virtual environment: .. code-block:: ruby wget https://gitlab.com/truttink/smap/-/archive/4.6.5/smap-4.6.5.tar.bz2 tar -xvf smap-4.6.5.tar.bz2 cd smap-4.6.5/ python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install . #Test if installation was successful: smap delineate --version #Output: (date) root - INFO: This is SMAP 4.6.5 (date) Delineate - INFO: SMAP-delineate started. 4.6.5 * `QualiMap (2.2.1) `_ .. code-block:: ruby wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip unzip qualimap_v2.2.1.zip cd qualimap_v2.2.1 #Test if installation was successful: ./qualimap multi-bamqc #Output (first part): Java memory size is set to 1200M Launching application... QualiMap v.2.2.1 Built on 2016-10-03 18:14 Selected tool: multi-bamqc #Possible error when using newer Java versions: Java memory size is set to 1200M Launching application... Unrecognized VM option 'MaxPermSize=1024m' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. #Fix by editing qualimap file: nano ./qualimap #Change -XX:MaxPermSize=1024m to new java option: -XX:MaxMetaspaceSize=1024m #Test installation again * `Docker (23.0.4) `_ * `BCFtools (1.13) `_ * Install locally: .. code-block:: ruby wget https://github.com/samtools/bcftools/releases/download/1.13/bcftools-1.13.tar.bz2 tar -xvf bcftools-1.13.tar.bz2 cd bcftools-1.13/ ./configure #if error => install dependencies depending on your OS (packages stated in terminal) #=> ./configure make sudo make install #Test if installation was successful: bcftools --version #Output: bcftools 1.13 Using htslib 1.13 Copyright (C) 2021 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. * Or install `BCFtools docker image `_: .. code-block:: ruby docker pull staphb/bcftools:1.13 #Test if installation was successful: docker run staphb/bcftools:1.13 bcftools --version #Output: bcftools 1.13 Using htslib 1.13 Copyright (C) 2021 Genome Research Ltd. License Expat: The MIT/Expat license This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. * `GATK (3.7-0) docker image `_ .. code-block:: ruby docker pull broadinstitute/gatk3:3.7-0 #Test if installation was successful: docker run broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar --version #Output: 3.7-0-gcfedb67 * `VCFtools (0.1.16) `_ * Install locally: .. code-block:: ruby wget https://github.com/vcftools/vcftools/releases/download/v0.1.16/vcftools-0.1.16.tar.gz tar -xvzf vcftools-0.1.16.tar.gz cd vcftools-0.1.16/ ./configure #if error => install dependencies depending on your OS (packages stated in terminal) #=> ./configure make sudo make install #Test if installation was successful: vcftools --version #Output: VCFtools (0.1.16) * Or install `VCFtools docker image `_: .. code-block:: ruby docker pull biocontainers/vcftools:v0.1.16-1-deb_cv1 #Test if installation was successful: docker run biocontainers/vcftools:v0.1.16-1-deb_cv1 vcftools --version #Output: VCFtools (0.1.16) * `PLINK (version 1.90b6.18) `_ * Install locally: .. code-block:: ruby mkdir PLINK1.19 cd PLINK1.19 wget http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20200616.zip unzip plink_linux_x86_64_20200616.zip #Move to a directory in PATH environment variable #for command line tool execution (e.g. /usr/bin) sudo mv plink /usr/bin/ #Test if installation was successful: plink --version #Output PLINK v1.90b6.18 64-bit (16 Jun 2020) * Or install `PLINK docker image `_: .. code-block:: ruby docker pull elixircloud/plink:1.9-cwl #Test if installation was successful: docker run elixircloud/plink:1.9-cwl plink --version #Output: PLINK v1.90p 64-bit (16 Jun 2020) * `fastSTRUCTURE 1.0 `_ * First install `conda `_ * Add `Bioconda channel `_ * Then install the `fastSTRUCTURE conda environment `_: .. code-block:: ruby conda create --name faststructure python=2.7 conda activate faststructure conda install -c bioconda faststructure #Test if installation was successful: chooseK.py #Output: Here is how you can use this script Usage: python /home/guest/miniconda3/envs/fastStructure/bin/chooseK.py --input= * `RStudio `_ * `renv package `_ is used for version control .. tab:: Latest versions * `Python `_ * `SMAP package `_ * `QualiMap `_ * `Docker `_ * `BCFtools `_ * `Latest GATK docker image `_ .. code-block:: ruby docker pull broadinstitute/gatk:latest * `VCFtools `_ * `PLINK `_ * `fastSTRUCTURE `_ * `RStudio `_ .. _Delineate: Read mapping quality control ============================ The method for read mapping quality control (QC) depends on the NGS method. Mapping quality can be evaluated with QualiMap for WGS, GBS and HiPlex reads. SMAP provides a more specific means of QC for mapped GBS reads: `SMAP delineate `_. .. tabs:: .. tab:: SMAP delineate SMAP delineate gives us insight into locus completeness and sequencing read depth across all samples. This is important for our workflow because no variants can be identified in samples where no reads are mapped in polymorphic regions. .. code-block:: ruby smap delineate *path_to_GBS_bam_files* -mapping_orientation ignore -p 8 --plot all --plot_type png -f 50 -g 300 --min_stack_depth 2 --max_stack_depth 500 --min_cluster_depth 10 --max_cluster_depth 1500 --max_stack_number 2 --min_stack_depth_fraction 10 --completeness 1 --max_smap_number 10 Parameters were set for double-enzyme GBS with merged reads in diploid individuals as described in `SMAP delineate examples `_. The quality of the mapped reads is depicted by the MergedCluster barplots and the saturation curve scatterplot. .. tabs:: .. tab:: MergedCluster barplots **Median read depth per Mergedcluster** .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/00_SMAP_delineate/final_stack_positions.MergedCluster.MedianRD.histogram.png The median read depth per MergedCluster barplot depicts how many MergedClusters are shared between at least half of the samples at a given depth. The completeness of the mapped reads can be determined by comparing the individual StackCluster read depth barplots to this one. All of our GBS samples have very similar read mapping distributions. **Mergedcluster completeness across samples** .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/00_SMAP_delineate/final_stack_positions.MergedCluster.Completeness.histogram.png This barplot shows the distribution of MergedClusters completeness across all samples. A MergedCluster is plotted at x-value 1 if that MergedCluster only has a sufficient read depth in one sample. In contrast, MergedClusters with sufficient read depth across all samples are plotted at the maximal x-value (number of samples in the sample set). Around 14000 MergedClusters are shared across all our samples because they have sufficient read depth in all samples (x = 9). Still, a lot of diversity if present judging by the high number of MergedClusters with no sufficient read depth for some samples. **SMAPs per MergedCluster** .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/00_SMAP_delineate/final_stack_positions.MergedCluster.SMAP.histogram.png The abundance of read mapping polymorphisms is depicted in the SMAPs per MergedCluster barplot. All MergedClusters where reads have the same read mapping start and end positions contain 2 SMAPs. InDels and SNPs cause an increased number of SMAPs per MergedCluster. Around 7000 MergedClusters display read mapping polymorphisms in the samples with GBS data. .. tab:: Saturation scatterplot .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/00_SMAP_delineate/StackCluster.Saturation.scatter.png The detection of MergedClusters in function of the number of mapped reads is depicted in the saturation scatterplot. Samples are sequenced at a sufficient depth when increasing the depth yields no substantial increase in detected MergedClusters (= saturation point). It is difficult to say whether the samples with the highest number of mapped reads have reached the saturation point. More GBS samples should be included in the experiment to determine this. However, the three samples with the lowest number of mapped reads are most likely not sequenced at a sufficient depth. .. tab:: QualiMap Qualimap `Multi-sample BAM QC `_ can be executed on all kinds of sequencing data. In addition, a bed or gff file with regions of interest can be provided to the tool which greatly reduces the execution time. A text file containing one column with sample names and a second column with the path to the respective bam file has to be provided. .. code-block:: ruby cd qualimap_v2.2.1/ ./qualimap multi-bamqc -d *file_with_file_names/paths* -r -gff *bed/gff_file_with_regions* -outdir *path_to_output* Multi-sample BAM QC outputs summary statistics, evaluation graphs, as well as sample specific evaluations. We will only focus on summary statistics in the tabs below. * :download:`Example of a full report <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/00_SMAP_delineate/multisampleBamQcReport.html>`. .. tabs:: .. tab:: HiPlex .. raw:: html

Globals

Number of samples	353
Total number of mapped reads	104,638,300
Mean samples coverage	2,618.76
Mean samples GC-content	41.62
Mean samples mapping quality	59.6

The summary statistics give a quick overview of the overall mapping quality of the samples. Check the samples statistics for sample-by-sample mapping quality scores. Mapping quality is a phred-scaled probability of the read alignment being correct. For example, scores of 30 and 60 imply a read has a probability of 99.9% and 99.9999%, respectively, of being mapped correctly. Sample coverage refers to the average number of times each base in the reference genome sequence is covered by reads. A coverage of 30 is considered the lower limit for many applications. HiPlex performs very well for sample coverage. This was expected because the sequencing method specifically targets these regions. The mapping quality is also very good. .. tab:: GBS .. raw:: html

Summary

Globals

Number of samples	9
Total number of mapped reads	8,784,773
Mean samples coverage	35.14
Mean samples GC-content	40.31
Mean samples mapping quality	59.66

GBS samples have a much lower coverage than HiPlex but still pass the threshold of 30. Mapping quality is similar to HiPlex. .. tab:: WGS .. raw:: html

Summary

Globals

Number of samples	2
Total number of mapped reads	487,563,633
Mean samples coverage	67.19
Mean samples GC-content	40.09
Mean samples mapping quality	59.58

The two WGS samples have double the coverage compared to GBS, but it is still a lot lower than HiPlex. The mapping quality is again more than adequate. .. _SNP: Calling SNPs ============ Next, BCFtools and GATK were compared for SNP calling. Polymorphisms were identified, SNPs were selected and filtered on position and quality. .. tabs:: .. tab:: GATK **SNP calling** The Unified Genotyper tool of GATK was used to perform the SNP calling by comparing the mapped reads to the `reference genome `_ of *C. canephora*. We provided a bed file containing the HiPlex loci regions because we are only interested in the SNPs inside them. This region-specific SNP calling greatly reduces computation time. The reference genome bed file has to be included if you need to call SNPs in the entire genome. A lot more resources are required for complete SNP calling but it can be worthwhile if the vcf file will be used in more pipelines. .. code-block:: ruby docker run -w *working_directory_in_container* -v *path_in_host_to_link*:*path_in_container_to_link* broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar \ -R *path_to_reference_genome_fasta_file* \ -T UnifiedGenotyper -I *file_with_bam_files* \ -o *path_to_vcf_file* \ -L *path_to_bed_file* The command also needs a **.list** file that contains all bam file names (\\n separated). * :download:`Example terminal output (for subset of 10 samples) <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/GATK_output.txt>` * :download:`Example output vcf file (first 1000 lines) <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/GATK.vcf>` * :download:`Example bed file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/ref_genome_Ccanephora_WF.bed>` The following command was used to determine the number of SNPs: .. code-block:: console grep -v ^# *path_to_vcf_file* | wc -l output: 12130453 **SNP filtering** *Amplicon region filter* After complete calling it is still possible to filter for SNPs inside the specified amplicon regions with following command: .. code-block:: ruby vcftools --vcf *path_to_vcf_file* --bed *path_to_bed_file* --recode --recode-INFO-all --out *path_to_output_1* output: After filtering, kept 364 out of 364 Individuals Outputting VCF file... After filtering, kept 1230 out of a possible 12130453 Sites Run Time = 201.00 seconds VCFtools was used to only retain SNPs inside the amplicon regions. It is possible the following warning is printed in the terminal: .. code-block:: ruby Example warning: Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> VCFtools does not expect commas in the description tag of the header lines. This warning did not have an influence on our input but if so, the warning can be prevented by removing/changing all commas in the description. *SNP/sample quality filter* .. code-block:: ruby vcftools --vcf *path_to_output_1* --min-meanDP 30 --mac 4 --minQ 20 --minGQ 30.0 --minDP 10.0 --recode --out *path_to_output_2* Output: After filtering, kept 364 out of 364 Individuals Outputting VCF file... After filtering, kept 415 out of a possible 1230 Sites Run Time = 0.00 seconds This command filtered on: * Minimum average depth per sample of 30 * Minimum allele count of 4 * Minimum SNP quality of 20 * Minimum genotype quality of 30 * Minimum depth per sample of 10 *Polymorphic SNP filtering* .. code-block:: ruby docker run -v *path_in_host_to_link*:*path_in_container_to_link* broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar \ -T SelectVariants \ -R *path_to_reference_genome_fasta_file* \ -V *path_to_output_2* \ --excludeNonVariants \ -o *path_to_output_3* * :download:`Example terminal output <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/GATK_polymorph_output.txt>`. The SelectVariants tool of GATK was used to retain polymorphic SNPs. The command does not output information about the filtering in the terminal. Check SNP number: .. code-block:: console grep -v ^# *path_to_output_3* | wc -l output: 415 Previous filtering apparently already removed non-polymorphic sites. *Biallelic SNP filtering* .. code-block:: ruby docker run -v *path_in_host_to_link*:*path_in_container_to_link* broadinstitute/gatk3:3.7-0 java -jar /usr/GenomeAnalysisTK.jar \ -T SelectVariants \ -R *path_to_reference_genome_fasta_file* \ -V *path_to_output_3* \ -restrictAllelesTo BIALLELIC \ -o *path_to_output_4* :download:`Example terminal output (for subset of 10 samples) <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/GATK_biallelic_output.txt>`. Next, the SelectVariants tool of GATK was used to only retain biallelic SNPs. Check SNP number: .. code-block:: console grep -v ^# *path_to_output_4* | wc -l output: 402 The result of the SNP calling and filtering with GATK/VCFtools was a total of 402 SNPs. This is be compared with BCFtools in the next tab. .. tab:: BCFtools **SNP calling** The second used SNP calling tool was BCFtools. .. code-block:: ruby bcftools mpileup -R *path_to_bed_file* -Ou -a FORMAT/AD,FORMAT/DP -f *path_to_reference_genome_fasta_file* --bam-list *path_to_list_bam_files* \ | bcftools call --skip-variants indels -mv -Ov -a GQ -o *path_to_vcf_file* First, genotype likelihoods at each covered genomic position were created with bcftools mpileup. The number of high-quality bases (DP) and allelic depth (AD) that will later be used to filter was added to the format column. Optionally, the mpileup analysis can be reduced to regions present in a bed file with the -R option. mpileup needs a **.txt** file that contains all bam file names (\\n separated). Next, BCFtools call performs the actual SNP calling. The option -a GQ was provided to include the genotype quality for filtering and InDels were skipped with the option --skip-variants indels. * :download:`Example vcf file (first 1000 lines) <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/bcf.vcf>` BCFtools does not give any terminal output regarding information about the execution time. This must be determined ourselves. **SNP filtering** The same filtering steps are performed for BCFtools as for GATK. *Amplicon region filter* After complete calling it is still possible to filter for SNPs inside the specified amplicon regions with the following command: .. code-block:: ruby vcftools --vcf *path_to_vcf_file* --bed *path_to_bed_file* --recode --recode-INFO-all --out *path_to_output_1* output: After filtering, kept 364 out of 364 Individuals Outputting VCF file... Read 96 BED file entries. After filtering, kept 1146 out of a possible 10573329 Sites Run Time = 441.00 seconds *SNP/sample quality filter* .. code-block:: ruby vcftools --vcf *path_to_output_1* --min-meanDP 30 --mac 4 --minQ 20 --minGQ 30.0 --minDP 10.0 --recode --out *path_to_output_2* output: After filtering, kept 364 out of 364 Individuals Outputting VCF file... After filtering, kept 469 out of a possible 1146 Sites Run Time = 0.00 seconds *Polymorphic SNP filtering* .. code-block:: ruby bcftools view -m2 *path_to_output_2* > *path_to_output_3* Only polymorphic sites were retained by setting the minimal number of alleles to 2. .. code-block:: console grep -v ^# *path_to_output_3* | wc -l output: 469 Previous filtering already removed non-polymorphic sites. *Biallelic SNP filtering* .. code-block:: ruby bcftools view -M2 *path_to_output_3* > *path_to_output_4* Only biallelic sites were retained by setting the maximum number of alleles to 2. .. code-block:: console grep -v ^# *path_to_output_4* | wc -l output: 439 We determined whether the GATK and BCFtools combined with filtering found a common set of SNPs or unique SNPs using a :download:`python script <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/compare_SNPs.py>`. .. code-block:: ruby python compare_SNPs.py --file1 *path_to_vcf_file1* --file2 *path_to_vcf_file2* --output *path_to_output* Example output: There are a total of 372 matching SNPs in files Coffea_GATK.vcf and Coffea_bcf.vcf There are 30 unique SNPs in Coffea_GATK.vcf There are 67 unique SNPs in Coffea_bcf.vcf Unique and shared SNPs are written in the file "SNP_comparison.txt." * :download:`Example ouput file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/01_SNP/SNP_comparison.txt>` Most of the SNPs were found by both tools. The vcf files contain 372 shared SNPs. GATK identified 30 unique SNPs and BCFtools identified 67 unique SNPs. .. tab:: Peformance GATK vs BCFtools .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/04_SNP/GATK_vs_bcf.gif :align: center .. list-table:: GATK vs BCFtools: performance and SNP numbers :widths: 25 25 25 25 25 :header-rows: 1 :stub-columns: 1 * - - GATK complete - GATK specific - BCFtools complete - BCFtools specific * - Execution time - 7h 6' - 29' - 7h 29' - 38' * - File size (MB) - 20 601.415 - 13.627 - 69 983.156 - 12.187 * - Initial SNPs - 12 130 453 - 1 238 - 10 573 329 - 1 158 * - Amplicon region - 1 230 - / - 1 146 - / * - SNP quality - 415 - 417 - 469 - 474 * - Polymorphic - 415 - 417 - 469 - 474 * - Biallelic - 402 - 402 - 439 - 444 There was a difference in SNP number between the complete and specific SNP calling. VCFtools was used to filter on SNPs within the HiPlex regions when SNPs on the entire genomic sequence were called (Complete calling). In contrast, BCFtools or GATK filtered on these regions when providing the HiPlex bed file for region-specific SNP calling, explaining the slightly differing results. The much larger file size for BCFtools can be explained by the different way BCFtools depicts a sample does not contain a SNP: ./.:0,0,0 GATK depicts this in a much simpler way which decreases the file size: ./. .. _PCA: PCA === A principal component analysis (PCA) was performed in RStudio on the filtered SNPs to quantify the genetic distances between samples of different locations and species. SNPs from both GATK and BCFtools were used and their results were compared. An annotation file is needed to annotate the samples with more information, such as sampling location, species information, sample ID, etc. * :download:`Example annotation file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/02_PCA/annotation_file.txt>` This :download:`renv lock file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/02_PCA/renv.lock>` in combination with the renv package can be used to recreate our exact RStudio environment. Put the renv lock file in a directory where you have permissions and where no other lock file is present. Skip the renv code block in the script below if you want to use the latest package versions. .. tabs:: .. tab:: PCA code * :download:`Download code <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/02_PCA/PCA_code.html>` .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/01_PCA/PCA_code.png The results from both GATK and BCFtools are shown below. The scatterplots were put on the same scale for comparison. .. tabs:: .. tab:: GATK .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/01_PCA/GATK.png *C. canephora* samples are separated from the rest of the coffee species on the PC1-axis. The *C. canephora* individuals are separated into groups of wild and cultivated origin on the PC2-axis. A couple of individuals, thought to be *C. canephora*, are placed among the other species. Conversely, an individual labeled as *Coffea congensis* was grouped together with *C. canephora*. .. tab:: BCFtools .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/01_PCA/bcf.png SNPs identified by BCFtools lead to a better separation of *C. canephora* and other species, as well as clearer differentiation between groups of wild and cultivated *C. canephora*. Additionally, the PC2-axis explains more variance compared to the SNP set obtained with GATK. Only the SNPs obtained with BCFtools will be used in the downstream analyses. .. _fastStruc: fastSTRUCTURE ============= Next, the genetic composition of the samples was determined using fastSTRUCTURE, which requires a PLINK formatted genotype call file as input. .. code-block:: ruby vcftools --vcf *path_to_filtered_vcf_file* --plink --out *path_to_output1* First, the filtered vcf file was converted to the PLINK format and .map and .ped files were created. Do not add any suffix to the output file name, this will be added automatically. We got some warnings after running the command: .. code-block:: ruby example warning: Unrecognized values used for CHROM: Cc_chr1 - Replacing with 0. The resulting .map file should contain 4 columns (without headers): Chromosome name, SNP name, Position in morgans or centimorgans (optional), Base-pair coordinates. .. csv-table:: Example .map :file: ../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Unedited.map :widths: 25, 25, 25, 25 :header-rows: 0 * :download:`Full .map file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea.map>` However, the first column of our .map file only contained zeros because VCFtools does not recognize the chromosome notation in the vcf file. Therefore, the chromosome number was added to the .map file with a :download:`python script <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/correct_map_ped.py>`. .. code-block:: ruby python correct_map_ped.py --map *path_to_map_file* --mapout *path_to_output_map_file* output: Used parameters for the script: Input map file = Coffea.map Output map file = Coffea_edited.map Input ped file = None Output ped file = None Annotation file = None Sample ID column = Sample_ID Population ID column = Population_ID Format of the converted .map file: .. csv-table:: Edited .map :file: ../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Edited.map :widths: 25, 25, 25, 25 :header-rows: 0 * :download:`Full edited .map file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_edited.map>` The .ped file has the following columns: Family ID, Individual ID, Father ID, Mother ID, Sex, Phenotype, Rest of the columns: Genotypes for each SNP. .. csv-table:: Example .ped :file: ../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Unedited.ped :widths: 25, 25, 10, 10, 10, 10, 10, 10 :header-rows: 0 * :download:`Full .ped file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea.ped>` However, in our case the first 6 columns of the .ped file only had one distinct value: the family ID and Individual ID are the same, and the other 4 columns contain zeros. The same python script was used to add a family ID to the .ped file which is required for fastSTRUCTURE. An annotation file is needed for the conversion: :download:`example <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Annotations.txt>` .. code-block:: ruby python correct_map_ped.py --ped *path_to_ped_file* --pedout *path_to_output_ped_file* --annot *path_to_annotation_file* --pop_ID *column_name_of_population_ID_in_annotation_file* --sample_ID *column_name_of_sample_ID_in_annotation_file* output: Used parameters for the script: Input map file = None Output map file = None Input ped file = Coffea.ped Output ped file = Coffea_edited.ped Annotation file = Annotations.txt Sample ID column = Sample_ID Population ID column = Collection Format of the converted .ped file: .. csv-table:: Edited .ped :file: ../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Edited.ped :widths: 25, 25, 10, 10, 10, 10, 10, 10 :header-rows: 0 * :download:`Full edited .ped file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_edited.ped>` Note that the python script can perform both conversions simultaneously. .. code-block:: ruby plink --file *path_to_output1(no_suffix)* --make-bed --out *path_to_output2(no_suffix)* Next, the converted .ped and .map files were used to create .bed and .map files using PLINK. Again, do not add any suffix to both the input and output file. Example output files: * :download:`Example .bed file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea.bed>` * :download:`Example .fam file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea.fam>` It is normal that the outputted .bed file is in a compressed format. Finally, fastSTRUCTURE was run: .. code-block:: ruby conda activate faststructure For i in `seq 2 9`; do structure.py --input= *path_to_output2(no_suffix)* --output= *path_to_output3(no_suffix)* -K $i --full --seed=100; done The tool requires the user to specify the number of clusters (-K option). fastSTRUCTURE was executed with multiple -K values because we did not know the optimal cluster number for our data. The output was examined afterwards to determine the best -K value. The tool should output 5 files per -K. Examples for K = 4: * :download:`.log <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_K4.log>` * :download:`.meanQ <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_K4.meanQ>`: mean ancestry proportion per sample * :download:`.meanP <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_K4.meanP>`: mean error rate of proportion per sample * :download:`.varQ <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_K4.varQ>`: variance of mean ancestry proportion per sample * :download:`.varP <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/Coffea_K4.varP>`: variance of mean error rate of proportion per sample .. code-block:: ruby conda activate faststructure chooseK.py --input *path_to_output3(no_suffix)* Output: Model complexity that maximizes marginal likelihood = 4 Model components used to explain structure in data = 4 The chooseK.py script was used to get the optimal number of clusters which is printed in the terminal. The fastSTRUCTURE results belonging to the optimal K-value were visualized with a :download:`R script <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/03_Faststructure/R_plot_faststructure.R>`. .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/03_faststructure/faststructure_plot.png The barplot depicts the population structure of each sample location where the colors define the different genetic resources: green (Congolese subgroup A), red (Lula cultivars), blue (wild) and purple (other *Coffea* spec.). All four genetic resources were present in the collection where most of the samples were assigned to the Lula cultivars (red). Some sample from the coffee collection showed an admixed composition of Lula-Wild or Lula-Congolese subgroup A. .. _Haplotype: SMAP haplotype-sites ==================== Multi-allelic haplotypes in the amplicon regions of the coffee samples were determined with `SMAP haplotype-sites `_ based on the filtered SNPs. The output is a table containing discrete haplotype calls for all the samples. .. code-block:: ruby smap haplotype-sites *path_to_data* *path_to_bed_file* *path_to_filtered_vcf_file* -mapping_orientation ignore -partial exclude --no_indels --min_read_count 10 --min_haplotype_frequency 2 -p 8 -j 2 -t png -o *path_to_output* -e dosage -i 10 10 90 90 -z 2 * :download:`Example terminal output <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/07_Haplo/output_haplo.txt>`. The program requires three input arguments: * Location of the bam files to be analysed * Location of the bed file containing the amplicon regions * Location of the filtered vcf file Mapping orientation was set to ignore because our reads are merged before mapping them to the reference genome. Use stranded orientation for GBS single-end or paired-end reads that are mapped separately. Only reads that completely cover the locus, including both start and end points, were considered for haplotyping with -partial exclude. -j and -z were set to 2 because we expect an allele count of 2 per locus per individual for a diploid organism. -e was set to dosage to work with allele copy numbers instead of allele frequencies. The -i option, specifying the frequency interval bounds, was set to the default for diploid species (See `options for discrete calling in individual samples `_). * :download:`Example discrete filtered haplotype call file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/07_Haplo/Coffea_filtered_discrete_haplotypes.tsv>` .. _grm: SMAP grm ======== The pairwise genetic similarities between all the samples was calculated and plotted using `SMAP grm `_. We used a :download:`python script adapted from SMAP grm <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/04_GRM/SMAP_grm.py>` because the command line tool has some issues with plotting the similarity heatmap. The python script was first executed to get a genetic distance matrix file which was then used to define the sample plotting order by clustering similar samples together. .. code-block:: ruby python SMAP_grm.py --table *path_to_filtered_discrete_tsv_file* -p 32 --print_sample_information Matrix \ --similarity_coefficient Jaccard -lc 0.1 --distance \--distance ensured a distance matrix is outputted which was needed to create a phylogenetic tree later on. The lc option was set to 0.1 to filter loci present in less than 10% of the samples. Jaccard was chosen as the similarity coefficient because it is appropriate for discrete dosage calls. Jaccard similarity coefficient: a / (a + b + c) * :download:`Example distance matrix file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/04_GRM/Coffea_distances_Jaccard.dist>` A text file can be provided to SMAP grm containing 1 mandatory and 2 optional columns. The first column contains the sample plotting order (mandatory). The second column is used to convert the sample names to new ones (optional). The third column contains group IDs to color the sample labels by (optional). Note that an empty second column is still needed if you only want to color the labels by group and not change the sample name! * :download:`Example file to order and color samples in heatmap <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/04_GRM/Coffea_sample_order_color.txt>` The distance matrix file was loaded in RStudio to determine the best sample plotting order for the heatmaps. We created two plotting orders in RStudio: one where only the pairwise distances are used to cluster samples together (Global clustering), and one where samples are first grouped by location and then clustered by pairwise distances to determine the structure of each sample location (Location clustering). The resulting sample orders were exported to a text file, with one sample per line. The Rscript can be downloaded :download:`here <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/04_GRM/matrix_to_sorted_heatmap.R>`. Different clustering techniques are used in the script to order the samples: * Single Linkage: * uses the minimum distance between any two points in different clusters to merge them * fast and can perform well on non-globular data * performs poorly in the presence of noise * Complete Linkage: * uses the maximum distance between any two points in different clusters to merge them * performs well on cleanly separated globular clusters * Median Linkage: * uses the median distance between all pairs of points in different clusters to merge them * compromises between single and complete linkage * Average Linkage: * computes the average distance between all pairs of points in different clusters to merge them * performs well on cleanly separated globular clusters * Centroid Linkage: * merges clusters by computing the distance between the centroids of different clusters * useful when dealing with spherical-shaped clusters * Ward's Method: * minimizes the total within-cluster variance when merging clusters * well-suited for datasets with a moderate to large number of observations * most effective method for noisy data * McQuitty's Method * uses the average distances between centroids of different clusters to merge them * suitable when dealing with large datasets We chose Ward's method (ward.D2) because we don't know the amount of noise in the data and it is widely used. .. code-block:: ruby python SMAP_grm.py --table *path_to_filtered_discrete_tsv_file* -n *path_to_sample_order_file* \ -p 32 --similarity_coefficient Jaccard -lc 0.1 \ --print_sample_information Plot --plot_format png --title_fontsize 100 \ --label_fontsize 18 --tick_fontsize 6 --legend_fontsize 100 --colour_map Spectral The heatmap plots the pairwise similarity scores between all samples. These similarity values can be added in each cell of the heatmap with the option \--annotate_matrix_plots, but this significantly increases the file size. The different font sizes might have to be increased or decreased depending on the size of your dataset. The `color map `_ can of course also be changed depending on your taste. Optionally, the dpi can be lowered for very large datasets with -r. The heatmaps sorted by location clustering and global clustering are shown in the tabs below. .. tabs:: .. tab:: Location clustering .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/05_GRM/Jaccard_genetic_similarity_heatmap_grouped.png (To better see the sample_IDs, download the :download:`full image <../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/05_GRM/Jaccard_genetic_similarity_heatmap_grouped.png>`) The population order is retained in this heatmap to determine the structure of each sample location. The sample labels have been colored according to their location. The Yangambi collection has 3 subgroups with high in-between similarity. Additionally, one sample shows severe dissimilarity across all samples except for two *Psilanthus manni* samples. The reference samples of the Congolese subgroup A only shows high similarity with each other. The Wild samples are by far the most related to each other than to other samples the collection or Congolese subgroup A. Finally, there are samples from the Yangambi collection that are more similar to some *Coffea* spec. samples indicating those collection samples are not *C. canephora*. Location clustering also serves as a confirmation of the fastSTRUCTURE results. For instance, there was a large presence of the blue cluster in the Yangambi collection, belonging to the Lula-Wild samples. In the heatmap, we can also see a subgroup in the Yangambi collection with larger similarities to the Wild samples than to the rest of its own group. On the other hand, it is harder to determine the genetic identity of the samples. This will be easier to spot in the heatmap with global clustering. .. tab:: Global clustering .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/05_GRM/Jaccard_genetic_similarity_heatmap_global.png (To better see the sample_IDs, download the :download:`full image <../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/05_GRM/Jaccard_genetic_similarity_heatmap_global.png>`) The sample locations were not taken into account in the creation of this heatmap. As a result, samples were only clustered by their similarity scores. Most of the samples are still placed in their original sample locations. The majority of the individuals from Unknown origin are inside the Yangambi collection which is expected for those that were thought to be *C. canephora*. Three of them: two samples labelled as 'Petit Kwilu' cultivars and one labelled *Coffea congensis* are also in this cluster. Four other members of 'Unknown origin' are clustered with the '*Coffea* spec.' samples, two of those were thought to be *C. canephora* and probably incorrectly labelled. The other two are in the correct cluster since they were labelled as *P. manni*. Still, they show no close relation to the *P. manni* samples of '*Coffea* spec.'. .. _phylo: Phylogenetic tree ================= Next, a phylogenetic tree was created in RStudio to visually represent both the genetic distances between our samples and the presence of different genetic resources in each sample location. The `Neighbour Joining `_ algorithm was used because it can be directly applied to the distance matrix created with SMAP grm. Moreover, branch lengths separating the samples are determined (unlike, for instance, UPGMA) and the tree can be easily rooted using an appropriate outgroup. An annotation file was used to color the tree branches by the sample locations. Our package versions can again be obtained with a :download:`renv lock file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/06_Phylo/renv.lock>`. * :download:`Example annotation file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/06_Phylo/Annotations.txt>` .. tabs:: .. tab:: Phylogenetic tree code * :download:`Download code <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/06_Phylo/Phylo_code.html>` .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/07_Phylo/Phylo_code.png .. tab:: Phylogenetic tree plot .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/07_Phylo/phyl_tree_NJ.png The phylogenetic tree shows clustering of samples of the same genetic resource. *Coffea* spec. individuals have longer branch lengths compared to the neighbouring wild and Yangambi samples, indicating they are less genetically similar. Clustering of some Yangambi samples together with samples collected from the natural forest in Yangambi is consistent with both SMAP grm and the PCA. .. tab:: Evaluation of tree .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/07_Phylo/NJ_control.png This evaluation graph plots the pairwise distances from the distance matrix from SMAP grm. The Neighbour Joining algorithm is able to adequately represent the distances from the input matrix judging by the high R²-score and position of most of the data inside the 95% prediction interval. .. _core: Core collection --------------- Establishing a core collection is a great way to ensure future sustainable and effective conservation management and use of any collection. Two strategies to propose core collections were tested using the `corehunter `_ RStudio package: * Maximization strategy (CC-I) * maximizes the genetic diversity by selecting the most diverse loci * Genetic distance method (CC-X) * maximizes the genetic distance between the entries of the core collection We were interested in creating core collections for the *C. canephora* samples of the Yangambi collection. Therefore, we deleted all non Yangambi samples and all non *C. canephora* samples with VCFtools, including the likely wrongly labelled *C. canephora* sample from the Yangambi collection. A file containing samples to be retained has to be provided to the command. .. code-block:: ruby vcftools --vcf *path_to_vcf_file* --keep *file_with_desired_samples* --recode --out *path_to_output* Next, the filtered vcf file from BCFtools was converted to the proper format for corehunter. In short, the GT column was simplified and all SNPs and samples were assigned an ID. The IDs before and after conversion were added to a cross reference file. Additionally, any SNPs that had no GT calls (0/0 or ./.) for any samples were removed and put in a text file for control. Conversion from vcf to core hunter file was performed with a custom :download:`python script <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/05_Core/convert_vcf_to_core.py>`. .. code-block:: ruby python convert_vcf_to_core.py --input *path_to_(subsetted)_vcf_file* --core *path_to_core_file* --cross_ref *path_to_cross_ref_file* * :download:`Example core file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/05_Core/Coffea_Core_hunter.txt>` * :download:`Example cross reference file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/05_Core/Coffea_Core_hunter_cross.txt>` Again, a :download:`renv lock file <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/05_Core/renv.lock>` can be used to recreate our exact RStudio environment. Be sure to put the lock file in a directory where you have permissions and where no other renv lock file is present. .. tabs:: .. tab:: Core collection code * :download:`Download code <../../download_files/tutorial/Coffea_canephora/02_cultivated_coffee/05_Core/Core_collector.html>` .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/06_Core/Core_collector.png .. tab:: Evaluation plots .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/06_Core/CC-I.png We chose the collection size corresponding to the local Shannon diversity and expected heterozygosity peak around size 40. We determined this peak is present at size 42 by looking at the values in the data frame. .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/06_Core/CC-X.png The CC-X collection size should be smaller than CC-I because CC-X contains collection extremes. We set a minimal allele coverage threshold of 95% for the collection because we also don't want to lose too much genetic information. The smallest collection size that meets this criterion is 28. .. image:: ../images/tutorial/00_Coffea_canephora/02_cultivated_coffee/06_Core/Core_PCA_no_annot.png Using the sampleCore function of Core Hunter we selected the most suitable samples for each core collection strategy and visualized them in the PCA scatterplot. The optimal core size for both core collection strategies is also stated in the legend. The samples assigned to both CC-I and CC-X core collection were evenly distributed across the ordination space of the Yangambi coffee collection. Only 3 samples are present in both CC-I and CC-X collections. #################################################### Population genetics of bullfrog, an invasive species #################################################### This study involves several hunderd diploid individuals, sampled from 38 locations. Each individual was genotyped using double-digest GBS (PstI-EcoRI). These populations were sampled in Belgium, where bullfrog is an invasive species. The purpose of the study is to analyse population structure and relate this to geographical spread of the species. GBprocesS --------- Demultiplexing was done using GBprocesS. See instruction for download and installation of GBprocesS. This is the config (.ini) file used for demultiplexing, trimming, merging forward and reverse reads, base quality filtering, and finaly removal of reads with internal restriction sites. Note that library preparation for such very large sample sets is usually done in several batches, whereby each batch reuses the sample-specific barcodes in the barcode adapter, and each batch is amplified with different sets of i5 and i7 Illumina barcodes during the PCR that adds the flowcell binding sites. In case some of the batch does not use all barcodes, it is important to include a "dummy" barcode in the demultiplexing (with non-existing sequence and the maximum length barcode used across the sample set). This is to make sure that all reads of each sample are trimmed to the same maximal length in the demultiplexing step. This is an example for a barcode file for demultiplexing of incomplete plates. For further explanation see the manual of GBprocesS. Mapping reads with BWA-MEM -------------------------- After demultiplexing and merging reads, they can be directly mapped onto a reference genome sequence. Here, we use the publicly available reference genome of bullfrog. Prepare the reference genome and index it:: BWA Map the reads of all samples onto the reference genome:: BWA mem SMAP delineate -------------- A first quality control of any GBS experiment is to run **SMAP delineate** on the BAM files of the entire sample set. SMAP delineate:: smap delineate $download_dir/Real_data/SMAP_delineate/input/ind/ -mapping_orientation ignore --processes 8 --plot all --plot_type pdf --min_stack_depth 2 --max_stack_depth 1500 --min_cluster_length 50 --max_cluster_length 300 --max_stack_number 20 --min_stack_depth_fraction 10 --min_cluster_depth 10 --max_cluster_depth 1500 --max_smap_number 20 --name 48_ind_GBS-PE SMAP delineate will provide an overview of the read mapping quality, locus length distribution, locus read depth distribution, the number of loci per sample (saturation curve) and locus completeness across the sample set. .. image:: ../images/tutorial/bullfrog/Bullfrog_saturation_curve_557_samples_RD15_C5.png In this case (as shown by the saturation curve), the normalisation of the sample set was not optimal. Almost none of the samples reached locus saturation, with more than 100.000 loci detected for the samples with the largest library size (5-6M reads mapped). This can be explained by the large genome size of the species (about 3 Gb). Most of the samples are positioned on the steep part of the saturation curve (2-3M reads, 40-60k loci), meaning they have insuffcient reads to cover all possible loci with minimal read depth (here RD 15), and sequencing deeper would have yielded more loci per sample. This relatively high level of missing data (due to technical reasons) affects the possibility to compare samples by "common loci". In addition, some of the samples have nearly no reads (100-500k reads) and a very minimal number of detected loci and are 'dropout' samples. The question, then, is how to set a threshold for minimal library size to retain good quality samples? Effect of library size on genetic relationship estimations ---------------------------------------------------------- To estimate the number of reads required per sample to estimate stable genomic fingerprints and to quantitatively estimate population genetics parameters, we first investigate how library size affects estimation of genetic relatedness. For this, we chose 14 samples with very high library size (4-6M merged reads after demultiplexing) representing different sampling locations. we quickly checked (with SMAP haplotype-window and SMAP grm) that these samples indeed represent the genetic diversity (range 0.7-1.0) across the sample set (sneak preview of the rest of the analysis). .. image:: ../images/tutorial/bullfrog/14_comp_subsampling_GRM.png We then created 'saturation' curves that show how many reads are required per sample to yield a given number of loci. Furthermore, we analysed how a reduced number of loci affects pairwise genetic relationship estimation (using SMAP delineate, SNP calling with SAMtools, SMAP haplotype-sites and finally SMAP grm). The purpose of this step is to estimate the minimum number of reads required per sample, and conversely, to select samples with at least that number of reads for the rest of the analysis. Other samples with lower number of reads are excluded, as they have too little data to yield good quality genetic fingerprints. We created a series of simulated incomplete samples by computationally subsampling the .fastq files:: head -n 400000 01_21_01.fq > 01_21_01_0.10.fq head -n 800000 01_21_01.fq > 01_21_01_0.20.fq head -n 1200000 01_21_01.fq > 01_21_01_0.30.fq head -n 1600000 01_21_01.fq > 01_21_01_0.40.fq head -n 2000000 01_21_01.fq > 01_21_01_0.50.fq head -n 3000000 01_21_01.fq > 01_21_01_0.75.fq head -n 4000000 01_21_01.fq > 01_21_01_1.00.fq head -n 5000000 01_21_01.fq > 01_21_01_1.25.fq head -n 6000000 01_21_01.fq > 01_21_01_1.50.fq head -n 7000000 01_21_01.fq > 01_21_01_1.75.fq head -n 8000000 01_21_01.fq > 01_21_01_2.00.fq head -n 9000000 01_21_01.fq > 01_21_01_2.25.fq head -n 10000000 01_21_01.fq > 01_21_01_2.50.fq head -n 11000000 01_21_01.fq > 01_21_01_2.75.fq head -n 12000000 01_21_01.fq > 01_21_01_3.00.fq Here, the name of the files is coded as follows (split by '_'): first code indicates the location, the second code the individual, the third the replicate. numbers at the end of the file indicate the number of reads included after subsampling (in M) to simulate incomplete data. This code is repeated for each of 14 samples (.fastq files). We then ran **SMAP delineate** again to create the corresponding saturation curve:: smap delineate $download_dir/Real_data/SMAP_delineate/input/ind/ -mapping_orientation ignore --processes 8 --plot all --plot_type pdf --min_stack_depth 2 --max_stack_depth 1500 --min_cluster_length 50 --max_cluster_length 300 --max_stack_number 20 --min_stack_depth_fraction 10 --min_cluster_depth 10 --max_cluster_depth 1500 --max_smap_number 20 --name 48_ind_GBS-PE .. image:: ../images/tutorial/bullfrog/Bullfrog_saturation_curve_276_samples_RD10_C5.png This curve indeed shows that the shape of the locus saturation curve observed for the entire sample set can be expained by suboptimal library size of a substantial number of samples. Next, we ran SAMtools to identify SNPs in the 14 samples with complete data:: samtools We used the .bed file with locus coordinates created by **SMAP delineate** to select 5000 loci with 2 SMAPs and completeness across all 14 samples. Running SAMtools on this locus set ensures that there is data in all 14 samples, and that in principle all pairwise comparisons to the "ground truth" can be made (as the ground truthh is based on "complete data"). Samtools identified 4000 SNPs in about 2500 loci. Then, we ran **SMAP haplotype-sites** to create haplotype tables, and ran SMAP grm to estimate genetic relatedness (Jaccard similarity of loci with partially shared alleles):: smap haplotype-sites smap grm .. image:: ../images/tutorial/bullfrog/GRM_similarity_saturation_curves_01_21.png This shows that the estimated genetic relatedness is stable from 1 - 1.25M or more merged reads. With less read data, the estimation of genetic fingerprints starts to deviate from the ground truth (compared to its own sample with saturated data, value expected to be 1.00). Selection of good quality samples --------------------------------- Now that we have estimated that using samples with at least 1M merged reads is required for good quality population genetic analysis, we will only continue with those 455 samples. These are distributed across the 38 locations as follows. .. image:: ../images/tutorial/bullfrog/nr_ind_per_location_1M.png We next run **SMAP delineate** again to estimate the completeness and saturation of this sample set. SMAP delineate:: smap delineate $download_dir/Real_data/SMAP_delineate/input/ind/ -mapping_orientation ignore --processes 8 --plot all --plot_type pdf --min_stack_depth 2 --max_stack_depth 1500 --min_cluster_length 50 --max_cluster_length 300 --max_stack_number 20 --min_stack_depth_fraction 10 --min_cluster_depth 10 --max_cluster_depth 1500 --max_smap_number 20 --name 48_ind_GBS-PE .. image:: ../images/tutorial/bullfrog/Bullfrog_saturation_curve_455_samples_RD10_C5.png Identify SNPs ------------- Instructions to identify and filter SNPs SMAP haplotype-sites -------------------- Create and navigate to a new output directory to run **SMAP haplotype-sites** on the BAM files with mapped GBS reads of a set of **individuals**, its BED file from **SMAP delineate**, and a VCF file with SNP calls (see for third-party SNP calling software: *e.g.* `SAMtools `_, `BEDtools `_, `Freebayes `_, or `GATK `_ for individuals):: mkdir -p $download_dir/Real_data/SMAP_haplotype_sites/output/ind cd $download_dir/Real_data/SMAP_haplotype_sites/output/ind smap haplotype-sites $download_dir/Real_data/SMAP_delineate/input/ind/ $download_dir/Real_data/SMAP_delineate/output/ind/final_stack_positions_48_ind_GBS-PE_C0_SMAP20_CL50_300.bed $download_dir/Real_data/SMAP_haplotype_sites/input/48_ind_GBS-PE.vcf --out haplotypes_48_ind_GBS-PE -mapping_orientation ignore --discrete_calls dosage --frequency_interval_bounds diploid --dosage_filter 2 --plot all --plot_type pdf -partial include --min_distinct_haplotypes 2 --min_read_count 10 --min_haplotype_frequency 5 --processes 8 SMAP grm -------- Create and navigate to a new output directory to run **SMAP grm** on the haplotype call table of a set of **individuals**:: mkdir -p $download_dir/Real_data/SMAP_grm/output/ind cd $download_dir/Real_data/SMAP_grm/output/ind smap grm $download_dir/Real_data/SMAP_delineate/input/ind/ $download_dir/Real_data/SMAP_delineate/output/ind/final_stack_positions_48_ind_GBS-PE_C0_SMAP20_CL50_300.bed $download_dir/Real_data/SMAP_haplotype_sites/input/48_ind_GBS-PE.vcf --out haplotypes_48_ind_GBS-PE -mapping_orientation ignore --discrete_calls dosage --frequency_interval_bounds diploid --dosage_filter 2 --plot all --plot_type pdf -partial include --min_distinct_haplotypes 2 --min_read_count 10 --min_haplotype_frequency 5 --processes 8 Deactivate your virtual environment:: deactivate