Advertisement

The ForAPP: Forensic Ancestry Prediction Pipeline for the interpretation of ancestry informative markers

Published:September 23, 2022DOI:https://doi.org/10.1016/j.fsigss.2022.09.005

      Abstract

      Interpretation of ancestry informative markers (AIMs) in a forensic case can be time-consuming as different tools are used that need to be initiated separately. We develop the ForAPP; an open-source pipeline (running offline) that initiates multiple ancestry prediction analyses and summarizes results in an interactive interface for interpretation of autosomal ancestry-informative markers.

      Keywords

      1. Introduction

      When the DNA profile of a potential suspect trace does not match within a case or in a DNA database, ancestry prediction can be one of the tools to limit the pool of suspects. Y-chromosomal and mitochondrial DNA analysis inform on the paternal and maternal lineage respectively; autosomal markers are of importance as they encompass the full ancestry, which can be ‘simple’ or admixed. Interpretation of autosomal AIMs is commonly conducted using several separate tools such as STRUCTURE [
      • Pritchard J.K.
      • Stephens M.
      • Donnelly P.
      Inference of population structure using multilocus genotype data.
      ] and Snipper [
      • Pereira R.
      • et al.
      Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing.
      ]. We develop the ForAPP (available on request); an open-source python-based pipeline (running offline) that initiates multiple ancestry prediction tools by one simple command and summarizes all results in an interactive interface.

      2. Methods

      Displayed examples performed by the ForAPP in this manuscript are based on data of 46 InDels [
      • Pereira R.
      • et al.
      Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing.
      ] from 3110 reference samples of seven continents obtained from 1000 genomes [
      • The 1000 Genomes Project Consortium
      A global reference for human genetic variation.
      ], ForInDel [
      • Carla S.
      • et al.
      Completion of a worldwide reference panel of samples for an ancestry informative Indel assay.
      ] and HGDP [
      • Cann H.M.
      • et al.
      A human genome diversity cell line panel.
      ] data. The sample illustrates a Northern African individual from a population that is not represented in the reference database.
      The ForAPP can handle three different types of input format: CE data (Genemarker tables), MPS FDSTools [
      • Hoogenboom J.
      • van der Gaag K.J.
      • et al.
      FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
      ] tables with allele markings in the ‘flags’ column and simple comma-separated files containing marker names and allele calls. Reference data is provided in a similar format as used for Snipper [
      • Pereira R.
      • et al.
      Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing.
      ] and can be adjusted to the used markers and reference data.

      3. Results and discussion

      In addition to the conventional STRUCTURE [
      • Cann H.M.
      • et al.
      A human genome diversity cell line panel.
      ] graphs, the 10 most resembling reference samples based on Euclidian distance are displayed. In addition, STRUCTURE cluster distribution of the three major clusters in the sample is displayed for each reference population to assess if the sample fits within the variation of any of the populations (Fig. 1).
      Fig. 1
      Fig. 1STRUCTURE results for a North African sample. Representation of a sample. (A) Cluster distribution of the sample and reference samples. (B) The sample is plotted on top of the distribution of the three major clusters in the population. (C) The 10 best matching reference samples are displayed.
      For explorative analysis, several interactive 3D PCA plots of three principle components are displayed showing relative distance of the sample and reference samples (Fig. 2), for each additional plot the least fitting reference population (ordered by LR) is removed to optimize the use of graph space.
      Fig. 2
      Fig. 23D PCA visualization of the ForAPP. Two of the additional plots are displayed where least fitting populations are not included.
      Likelihood Ratio (LR compared to top likelihood), Support Vector Machine (SVM) and Logistic regression (LogReg) algorithms are used to predict (non-admixed) ancestry. For SVM and LogReg, confusion matrices display correct and wrong predictions in the reference samples based on the model. In addition, for each model two scoring metrics (precision and recall) and the probability are displayed (Fig. 3).
      Fig. 3
      Fig. 3ForAPP Likelihood and Logistic Regression results. Confusion matrix showing the accuracy of the model is only displayed here for LogReg.

      4. Conclusion

      The ForAPP provides an easy way to apply commonly used ancestry prediction tools on data of AIMs of choice. In the reference data, two levels of ancestry ‘resolution’ are used that can be chosen for analysis on the continental (superpopulation) or population level to provide additional support on a more detailed ancestry. The example shown displays a common challenge for ancestry interpretation as the population of origin is not represented in the reference data. While this is clearly visible from the STRUCTURE and PCA results, LRs and used machine learning algorithms are limited to single origin only. As for any prediction, accuracy will depend on the available reference data.

      Conflict of interest statement

      The authors declare that there is no conflict of interest.

      References

        • Pritchard J.K.
        • Stephens M.
        • Donnelly P.
        Inference of population structure using multilocus genotype data.
        Genetics. 2000; 155: 945-959
        • Pereira R.
        • et al.
        Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing.
        PLoS One. 2012; 7e29684
        • The 1000 Genomes Project Consortium
        A global reference for human genetic variation.
        Nature. 2015; 526: 68-74
        • Carla S.
        • et al.
        Completion of a worldwide reference panel of samples for an ancestry informative Indel assay.
        Forensic Sci. Int. Genet. 2015; 17: 75-80
        • Cann H.M.
        • et al.
        A human genome diversity cell line panel.
        Science. 2002; 296: 261-262
        • Hoogenboom J.
        • van der Gaag K.J.
        • et al.
        FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
        Forensic Sci. Int. Genet. 2017; 27: 27-40