Advertisement

A multivariate statistical approach to for the evaluation of the biogeographical ancestry information from traditional STRs

  • Eugenio Alladio
    Correspondence
    Corresponding author.
    Affiliations
    Reparto Carabinieri Investigazioni Scientifiche di Roma – Sezione di Biologia, Viale Tor di Quinto 119, 00191 - Rome, Italy

    Dipartimento di Chimica, Università degli Studi di Torino, via P. Giuria 7, 10125 Torino, Italy

    Centro Regionale Antidoping e di Tossicologia “A. Bertinaria”, Regione Gonzole 10/1, 10043 - Orbassano (TO), Italy
    Search for articles by this author
  • Chiara Della Rocca
    Affiliations
    Dipartimento di Biologia e Biotecnologie “Charles Darwin”, Sapienza Università di Roma, P.le Aldo Moro 5, 00185 - Rome, Italy
    Search for articles by this author
  • Fulvio Cruciani
    Affiliations
    Dipartimento di Biologia e Biotecnologie “Charles Darwin”, Sapienza Università di Roma, P.le Aldo Moro 5, 00185 - Rome, Italy
    Search for articles by this author
  • Marco Vincenti
    Affiliations
    Dipartimento di Biologia e Biotecnologie “Charles Darwin”, Sapienza Università di Roma, P.le Aldo Moro 5, 00185 - Rome, Italy

    Centro Regionale Antidoping e di Tossicologia “A. Bertinaria”, Regione Gonzole 10/1, 10043 - Orbassano (TO), Italy
    Search for articles by this author
  • Paolo Garofano
    Affiliations
    Centro Regionale Antidoping e di Tossicologia “A. Bertinaria”, Regione Gonzole 10/1, 10043 - Orbassano (TO), Italy
    Search for articles by this author
  • Andrea Berti
    Affiliations
    Reparto Carabinieri Investigazioni Scientifiche di Roma – Sezione di Biologia, Viale Tor di Quinto 119, 00191 - Rome, Italy
    Search for articles by this author
  • Filippo Barni
    Affiliations
    Reparto Carabinieri Investigazioni Scientifiche di Roma – Sezione di Biologia, Viale Tor di Quinto 119, 00191 - Rome, Italy
    Search for articles by this author
Published:September 30, 2019DOI:https://doi.org/10.1016/j.fsigss.2019.09.097

      Abstract

      The capability to achieve biogeographic ancestry (BGA) information from DNA profiles have been largely explored in forensic genetics because of its potential usefulness in providing investigative clues. For law enforcement and security purposes, when genetic data have been obtained from unknown evidence, but no reference samples are available and no hints come out from DNA databases, it would be extremely useful at least to infer the ethno-geographic origin of the stain donor by just examining traditional STRs DNA profiles.
      Current protocols for ethnic origin estimation using STRs profiles are usually based on Principal Component Analysis approaches and Bayesian methods. The present study provides an alternative approach that involves the use of target multivariate data analysis strategies for estimation of the BGA information from unknown biological traces. A powerful multivariate technique such as Partial Least Squares-Discriminant Analysis (PLS-DA) has been applied on NIST U.S. population datasets containing, for instance, the allele frequencies of African-American, Asian, Caucasian and Hispanic individuals. PLS-DA approach provided robust classifications, yielding high sensitivity and specificity models capable of discriminating the populations on ethnic basis. Finally, a real casework has been examined by extending the developed model to smaller and more geographically-restricted populations involving, for instance, Albanian, Italian and Montenegrian individuals.

      Keywords

      1. Introduction

      STR markers are widely utilized for the interpretation process of single source and DNA mixtures samples collected during crime scene investigation activities. On the other hand, the continuous technical improvements in forensic genetics related to the so-called Next Generation Sequencing (NGS) or Massive Parallel Sequencing (MPS) allowed analysts to focus on specific markers such as Single Nucleotide Polymorphisms (SNPs) in order to assess the information about the geographical ancestry and ethnic origins of individuals, even from degraded samples [
      • Phillips C.
      • Santos C.
      • Fondevila M.
      • Carracedo Á.
      • Lareu M.V.
      Inference of ancestry in forensic analysis I: autosomal ancestry-informative marker sets.
      ]. For this purpose, several approaches have been developed by evaluating SNPs or autosomal STRs markers in terms of Bayesian statistics [
      • Porras-Hurtado L.
      • Ruiz Y.
      • Santos C.
      • Phillips C.
      • Carracedo Á.
      • Lareu M.V.
      An overview of STRUCTURE: applications, parameter settings, and supporting software.
      ], turning capable of estimating the ethnic affiliation of unknown genetic profiles. However, multivariate classification techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) may provide useful further advantages when evaluating the ancestry of unknown subjects’ genetic profiles such as, for instance, the possibility of recognizing the markers or, eventually, the alleles that significantly distinguish two or more populations under comparison. The use of this approach allows to simultaneously perform specific and sensitive discriminations among different ancestry groups, even when the populations are geographically related.

      2. Material studied, methods, techniques

      2.1 DNA data

      The study was performed by testing NIST U.S. population datasets as reference database, which contains the allele frequencies of 29 autosomal STRs from selected U.S. African-American, Asian, Caucasian and Hispanic populations [
      • Hill C.R.
      • Duewer D.L.
      • Kline M.C.
      • Coble M.D.
      • Butler J.M.
      U.S. Population data for 29 autosomal STR loci.
      ]. It is composed by 1036 subjects, divided into 342 African-American, 361 Caucasian, 236 Hispanic and 97 Asian individuals. Then, with the aim of evaluating a known DNA profile related to a real casework that had been previously genotyped in our laboratories, allele frequencies databases containing the STRs frequencies of individuals from Albania [
      • Kubat M.
      • Škavić J.
      • Behluli I.
      • Nuraj B.
      • Bekteshi T.
      • Behluli M.
      • et al.
      Population genetics of the 15 AmpF/STR Identifiler loci in Kosovo Albanians.
      ], Italy [
      • Berti A.
      • Brisighelli F.
      • Bosetti A.
      • Pilli E.
      Allele frequencies of the new European Standard Set (ESS) loci in the Italian population.
      ] and Montenegro [
      • Jeran N.
      • Havas D.
      • Ivanović V.
      • Rudan P.
      Genetic diversity of 15 STR loci in a population of Montenegro.
      ] were taken into account to estimate the BGA affiliation of a Caucasian individual according to preliminary investigations.

      2.2 Data analysis and statistics

      Multivariate models were built on simulated DNA profiles extracted from the NIST U.S. population database. These genotypes were generated using the forensim package [
      • Haned H.
      Forensim: An open-source initiative for the evaluation of statistical methods in forensic genetics.
      ] available in the R (version 3.6.1) statistical environment [
      • R Core Team
      R: A Language and Environment for Statistical Computing.
      ]; this package allows to simulate series of independent DNA profiles by considering the allele frequencies reported in reference populations databases. As remarked by Haned [
      • Haned H.
      Forensim: An open-source initiative for the evaluation of statistical methods in forensic genetics.
      ], when simulating the DNA profiles two alleles are randomly selected at each given locus by considering the reported allele frequencies. The four simulated datasets (i.e. one for each NIST U.S. population group) contained the same number of subjects equal to 500 DNA profiles in order to uniform the number of subjects of each affiliation group. A comprehensive dataset of 2000 simulated DNA profiles was assembled and then randomly split into two different datasets, consisting in a training set and an evaluation set, with a number of 1600 and 400 individuals, respectively. The same approach was performed on the Albanian, the Italian and the Montenegrian databases by simulating 600 independent individuals for all the tested populations (subjects have been divided into training and evaluation sets consisting of 1500 and 300 individuals, respectively). Finally, PLS-DA [
      • Ballabio D.
      • Consonni V.
      Classification tools in chemistry. Part 1: linear models. PLS-DA.
      ] models were calculated to assess their predictive capabilities in terms of BGA estimation by discriminating the individuals from the different populations. Models are built in accordance with the probability distributions of the most discriminant loci and alleles relative to the populations to which the subjects belong. Each of the tested simulated individual is then discriminated into the specific BGA affiliation that shows the highest score in terms of probability.3

      3. Results

      A PLS-DA model was evaluated for the different NIST U.S. populations, as reported in Fig. 1. The simulated genetic profiles of the training set are reported on the left side of the Scores Plots (Fig. 1, a.-d.), while the ones contained into the evaluation set are showed on the right side of the Scores Plots (Fig. 1, a.-d.). The discrimination model showed sensitivity and specificity values equal to 99% for the African-American individuals, 94% and 98.0% for the Asian subjects, and 94% and 97% for the Caucasian population, thus indicating optimal predictive PLS-DA models. A robust, but less predictive, model for the discrimination of the Hispanic BGA was obtained, too, showing sensitivity and specificity values equal to 86% and 97%, respectively.
      Fig. 1
      Fig. 1PLS-DA Scores Plot relative to models provided for the discrimination of (a.) African-American (red diamonds), (b.) Asian (green squares), (c.) Caucasian (blue triangles) and (d.) Hispanic (light blue triangles) ethnicities. (e.) PLS-DA Scores Plot (the Albanian subjects are indicated by red diamonds, the Italian subjects are remarked by green squares and the Montenegrian individuals are indicated by blue triangles) relative to models provided for the estimation of the affiliation of a Caucasian individual (remarked by light blue triangles) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).

      4. Discussions

      PLS-DA proved to be an useful tool for the estimation of the BGA origin of STRs DNA profiles. The lower sensitivity values obtained for the Hispanic population could be related to a higher rate of admixtures within the collected individuals, when compared to the other tested BGA affiliations. Since this approach turned capable of distinguishing different populations, PLS-DA was tested on more geographically-restricted populations, such as Albanian, Italian and Montenegrian individuals, aiming to predict the BGA origin for an individual defined as Caucasian. A novel PLS-DA model was calculated once again by using simulated STRs DNA profiles, as reported in Fig. 1(e.). In the present case, the individual under examination has been predicted within the Montenegrian population; this result was later confirmed by law enforcement investigations, thus showing the reliability of the used multivariate approach.

      5. Conclusions

      PLS-DA approach on simulated STRs DNA datasets provided robust predictive models, yielding high sensitivity and specificity values for the discrimination of selected populations on BGA affiliation basis. A real casework was examined, too, with the aim of extending the tested model to smaller and more specific populations. This application might represent a powerful tool for law enforcement agencies whenever a biological evidence is collected at a crime scene or recovered during mass-disaster and missing persons investigations.

      Declaration of Competing Interest

      None.

      References

        • Phillips C.
        • Santos C.
        • Fondevila M.
        • Carracedo Á.
        • Lareu M.V.
        Inference of ancestry in forensic analysis I: autosomal ancestry-informative marker sets.
        Methods Mol. Biol. 2016; : 233-253https://doi.org/10.1007/978-1-4939-3597-0_18
        • Porras-Hurtado L.
        • Ruiz Y.
        • Santos C.
        • Phillips C.
        • Carracedo Á.
        • Lareu M.V.
        An overview of STRUCTURE: applications, parameter settings, and supporting software.
        Front. Genet. 2013; 4https://doi.org/10.3389/fgene.2013.00098
        • Hill C.R.
        • Duewer D.L.
        • Kline M.C.
        • Coble M.D.
        • Butler J.M.
        U.S. Population data for 29 autosomal STR loci.
        Forensic Sci. Int. Genet. 2013; 7: e82-e83https://doi.org/10.1016/j.fsigen.2012.12.004
        • Kubat M.
        • Škavić J.
        • Behluli I.
        • Nuraj B.
        • Bekteshi T.
        • Behluli M.
        • et al.
        Population genetics of the 15 AmpF/STR Identifiler loci in Kosovo Albanians.
        Int. J. Legal Med. 2004; 118: 115-118https://doi.org/10.1007/s00414-004-0430-y
        • Berti A.
        • Brisighelli F.
        • Bosetti A.
        • Pilli E.
        Allele frequencies of the new European Standard Set (ESS) loci in the Italian population.
        Forensic Sci. Int. Genet. 2011; 5: 548-549https://doi.org/10.1016/j.fsigen.2010.01.006
        • Jeran N.
        • Havas D.
        • Ivanović V.
        • Rudan P.
        Genetic diversity of 15 STR loci in a population of Montenegro.
        Coll. Antropol. 2007; 31: 847-852
        • Haned H.
        Forensim: An open-source initiative for the evaluation of statistical methods in forensic genetics.
        Forensic Sci. Int. Genet. 2011; 5: 265-268https://doi.org/10.1016/j.fsigen.2010.03.017
        • R Core Team
        R: A Language and Environment for Statistical Computing.
        R Foundation for Statistical Computing, Vienna, Austria2013
        • Ballabio D.
        • Consonni V.
        Classification tools in chemistry. Part 1: linear models. PLS-DA.
        Anal. Methods. 2013; 5: 3790-3798https://doi.org/10.1039/c3ay40582f