Advertisement

Evaluating probabilistic genotyping for low-pass DNA sequencing

Published:October 06, 2022DOI:https://doi.org/10.1016/j.fsigss.2022.10.001

      Abstract

      Most genomic methods consider the sample genotype. Data are evaluated at some location, and if the signal strength is sufficient, a genotype call is made. Conversely, sites that lack sufficient signal are treated as missing data. Such methods for genotype calling are binary, and this dichotomy limits genomic analyses to relatively high-coverage (and high-cost) massively parallel sequencing (MPS) data. It follows that bioinformatic methods that rely on genotypes may not be ideal for trace DNA samples, such as those sometimes encountered in forensic investigations, but even when applicable such analyses can be expensive. However, there are some genomic analyses where having many uncertain genotypes (with measured uncertainty) assayed over the entirety of the genome may be more powerful than current multi-locus approaches that consider a limited number of well-characterized markers. Methods for such problems may rely on genotype likelihood, which expresses the likelihood of alternative genotype calls in addition to the most likely call. One application that can benefit from genotype likelihoods is kinship analysis. NgsRelate is a bioinformatic tool that infers pairwise relatedness using a probabilistic genotyping framework, which accommodates the uncertainty associated with genotype calls for low-pass MPS data. Here, NgsRelate was used to infer kinship coefficients from low-pass whole genome sequencing data from a known pedigree. Multiple samples in a titration series (ranging from 50 ng to 0.5 ng) on a single MPS S4 flow cell were assessed. A reproducible scientific bioinformatic workflow was developed to evaluate kinship coefficients considering up to 3rd degree relatives. NgsRelate was found to provide robust assessments of kinship. Further, the use of low-pass MPS data provides a more cost-effective way to conduct forensic investigations.

      Keywords

      1. Introduction

      The use of single nucleotide polymorphisms (SNPs) has emerged as a powerful approach in kinship analysis [
      • Yagasaki K.
      • et al.
      Practical forensic use of kinship determination using high-density SNP profiling based on a microarray platform, focusing on low-quantity DNA.
      ]. SNP profiles can be assayed using different technologies, commonly by whole-genome sequencing (WGS), targeted sequencing or microarray-based genotyping. Microarray-based genotyping is accurate, easy to genotype (with less demanding bioinformatic analysis), and inexpensive [
      • Kling D.
      • et al.
      Investigative genetic genealogy: current methods, knowledge and practice.
      ]. However, accurately determining the SNP genotype requires ample DNA, an infeasible requirement in many forensic investigations [
      • Yagasaki K.
      • et al.
      Practical forensic use of kinship determination using high-density SNP profiling based on a microarray platform, focusing on low-quantity DNA.
      ,
      • Kling D.
      • et al.
      Investigative genetic genealogy: current methods, knowledge and practice.
      ]. Targeted sequencing (for example, Verogen’s ForenSeq Kintelligence kit) focuses on a smaller subset of SNPs and technique such as hybridization capture method lacks a large SNP panel that will be applicable in forensic genetics [
      • Kling D.
      • et al.
      Investigative genetic genealogy: current methods, knowledge and practice.
      ]. WGS may seem expensive, although, it offers much more refined and advanced workflow than other technologies [
      • Kling D.
      • et al.
      Investigative genetic genealogy: current methods, knowledge and practice.
      ]. Additionally, WGS potentially increases the detection power for distant relatives [
      • Li H.
      • et al.
      Relationship estimation from whole-genome sequence data.
      ]. The cost of WGS analyses (a limiting factor in most cases) can be lowered, as shown in current study, by sequencing more than the recommended number of samples in a single run. However, low-pass sequencing data can lead to inaccurate genotype calls [
      • Nielsen R.
      • et al.
      SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data.
      ]. An alternative approach, inferring pairwise relatedness, employs genotype likelihoods rather than genotype calls (see [
      • Korneliussen T.S.
      • et al.
      NgsRelate: a software tool for estimating pairwise relatedness from next-generation sequencing data.
      ,
      • Nøhr A.K.
      • et al.
      NGSremix: a software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data.
      ]). In this study, the genotype likelihood based KING-robust estimator [
      • Waples R.K.
      • et al.
      Allele frequency‐free inference of close familial relationships from genotypes or low‐depth sequencing data.
      ], implemented in NgsRelate v2 [
      • Korneliussen T.S.
      • et al.
      NgsRelate: a software tool for estimating pairwise relatedness from next-generation sequencing data.
      ], was used to infer kinship coefficients from a low-pass WGS data from a known pedigree. NgsRelate is a bioinformatic tool that can infer pairwise relatedness using a probabilistic genotyping framework, thus, it is well-suited to low-pass massively parallel sequencing (MPS) data.

      2. Methods

      In this study, WGS was performed on forty samples selected from a known three-generation pedigree. The DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research. Eight samples were serially diluted from 50 ng to 0.5 ng (50 ng, 10 ng, 5 ng, 1 ng, 0.5 ng in NA13050, NA13047, NA07013, NA07035, NA07028, NA07437, NA07046, NA13053). Illumina DNA Prep-Kit was used to prepare DNA libraries according to the manufacturer’s protocol. Briefly, DNA samples were tagmented and amplified using unique dual indexed adapters from the IDT®. The amplified forty libraries were purified, quantified, normalized, pooled (2 nM), and sequenced on a NovaSeq 6000 (S4 Reagent Kit v1.5, 300 cycles).
      FASTQ files from the NovaSeq were mapped to the GRCh38 reference genome and sorted/indexed BAM files were created [
      • Li H.
      ,
      • Danecek P.
      • et al.
      Twelve years of SAMtools and BCFtools.
      ]. Optical and PCR duplicates were marked [
      • McKenna A.
      • et al.
      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
      ], indels were realigned [
      • McKenna A.
      • et al.
      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
      ], and base-quality scores were recalibrated [
      • McKenna A.
      • et al.
      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
      ]. Variant calling was performed with BCFtools [
      • Danecek P.
      • et al.
      Twelve years of SAMtools and BCFtools.
      ] on pairs of BAM files, generating VCF files with both genotypes and genotype likelihoods. Problematic regions exist in the genome, for example repetitive regions may be under-represented relative to the actual sequence present [
      • Amemiya H.M.
      • et al.
      The ENCODE Blacklist: identification of problematic regions of the genome.
      ]. These regions lead to likely artifacts in genome assemblies and inaccurate interpretation [
      • Amemiya H.M.
      • et al.
      The ENCODE Blacklist: identification of problematic regions of the genome.
      ]. Such regions were identified and removed. VCF files were annotated with non-Finnish European population allele frequencies from gnomAD [
      • Karczewski K.J.
      • et al.
      The mutational constraint spectrum quantified from variation in 141,456 humans.
      ] and optionally down-sampled to only include well-described sites. For the latter, the union of sites in Illumina’s Infinium Global Screening Array (GSA), a sparse set of ‘reliable’ SNPs useful for kinship analysis [
      • Arthur R.
      • et al.
      AKT: ancestry and kinship toolkit.
      ] and SNPs from Verogen’s ForenSeq Kintelligence system were taken. Collectively these sites are termed GSA-KR. Kinship coefficients were estimated using either the genotype [
      • Manichaikul A.
      • et al.
      Robust relationship inference in genome-wide association studies.
      ] or the genotype likelihood based estimator of the KING-robust algorithm [
      • Waples R.K.
      • et al.
      Allele frequency‐free inference of close familial relationships from genotypes or low‐depth sequencing data.
      ] as implemented in NgsRelate (v2). The root-mean squared error (RMSE) was computed by comparing the observed (inferred) versus the expected kinship coefficient.

      3. Results and discussion

      3.1 WGS throughput

      A total of 15.3 billion reads was generated. Of the total reads, 12% were optical and/or PCR duplicates. Coverage was estimated using 10,000 random positions from non-duplicate reads. Coverage ranged from 3.3 × to 8 × across DNA inputs (0.5 ng – 50 ng). Coverage was variable both within and between subjects. For instance, NA07028 had 3.3 × and 6 × coverage for 1 ng and 50 ng of DNA, respectively, while NA13053 showed roughly the opposite (6.5 × and 3.5 × for the same DNA inputs). No clear association between the amount of DNA used and read-depth was apparent, as may be expected given library normalization [
      • Riman S.
      • et al.
      Understanding the characteristics of sequence-based single-source DNA profiles.
      ].

      3.2 Kinship estimation

      Kinship coefficients were estimated comparing all 50 ng samples to all samples (0.5 ng to 50 ng) so as to emulate the forensic use-case. Kinship inferred using genotype likelihood down-sampled to only GSA-KR (herein, GL-GSA-KR) sites were more accurate (RMSE 0.011) than from genotypes assessed across the whole genome (herein, genotypes, RMSE 0.116). Using genotypes also led to systematically lower kinship coefficient estimates for all the degrees of relatedness, including the unrelated class (Fig. 1). Only considering genotypes (and not genotype likelihoods) with GSA-KR SNPs showed comparable levels of error to the whole genome SNP assessment (RMSE 0.115), suggesting a benefit to using genotype likelihoods in conjunction with SNPs that are well-characterized.
      Fig. 1
      Fig. 1Inferred kinship coefficients (y-axis and colors) are contrasted against the relationship degree (x-axis and colored horizontal lines). Kinships coefficients were estimated using two approaches (outer rows): genotype calls from WGS and from SNPs pre-selected to be variable in the population and considering the genotype likelihood (GL-GSA-KR). Kinship coefficients were inferred for serially diluted DNA input from 0.5 ng to 50 ng (outer columns) over a range of expected relationships.

      4. Conclusion

      Accurately and cost-effectively estimating kinship coefficients is essential in forensic investigations. With advances in MPS technologies, WGS is a promising approach. However, costs associated with WGS and the resulting uncertain genotypes from low-pass sequencing data remain a challenge. This study demonstrates multiple samples sequenced together can be a cost-effective approach to WGS. Additionally, selecting sites known to vary in the population (e.g., GSA-RK SNPs) can be used to improve estimates of kinship from low-pass sequencing data.

      Conflict of interest statement

      None.

      Acknowledgments

      This research was supported in part by award 2019-DU-BX-0046 awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice and by internal funds from the Center for Human Identification.

      References

        • Yagasaki K.
        • et al.
        Practical forensic use of kinship determination using high-density SNP profiling based on a microarray platform, focusing on low-quantity DNA.
        Forensic Sci. Int. Genet. 2022; 102752
        • Kling D.
        • et al.
        Investigative genetic genealogy: current methods, knowledge and practice.
        Forensic Sci. Int. Genet. 2021; 52102474
        • Li H.
        • et al.
        Relationship estimation from whole-genome sequence data.
        PLoS Genet. 2014; 10e1004144
        • Nielsen R.
        • et al.
        SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data.
        PLoS One. 2012; 7e37558
        • Korneliussen T.S.
        • et al.
        NgsRelate: a software tool for estimating pairwise relatedness from next-generation sequencing data.
        Bioinformatics. 2015; : btv509
        • Nøhr A.K.
        • et al.
        NGSremix: a software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data.
        G3 Genes Genomes|Genet. 2021; : jkab174
        • Waples R.K.
        • et al.
        Allele frequency‐free inference of close familial relationships from genotypes or low‐depth sequencing data.
        Mol. Ecol. 2019; 28: 35-48
        • Li H.
        Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013; https://doi.org/10.48550/ARXIV.1303.3997
        • Danecek P.
        • et al.
        Twelve years of SAMtools and BCFtools.
        GigaScience. 2021; 10: giab008https://doi.org/10.1093/gigascience/giab008
        • McKenna A.
        • et al.
        The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
        Genome Res. 2010; 20: 1297-1303
        • Amemiya H.M.
        • et al.
        The ENCODE Blacklist: identification of problematic regions of the genome.
        Sci. Rep. 2019; 9: 9354
        • Karczewski K.J.
        • et al.
        The mutational constraint spectrum quantified from variation in 141,456 humans.
        Nature. 2020; 581: 434-443
        • Arthur R.
        • et al.
        AKT: ancestry and kinship toolkit.
        Bioinformatics. 2017; 33: 142-144
        • Manichaikul A.
        • et al.
        Robust relationship inference in genome-wide association studies.
        Bioinformatics. 2010; 26: 2867-2873
        • Riman S.
        • et al.
        Understanding the characteristics of sequence-based single-source DNA profiles.
        Forensic Sci. Int. Genet. 2020; 44102192