If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Reparto Carabinieri Investigazioni Scientifiche di Roma – Sezione di Biologia, Viale Tor di Quinto 119, 00191 - Rome, ItalyDipartimento di Chimica, Università degli Studi di Torino, via P. Giuria 7, 10125 Torino, ItalyCentro Regionale Antidoping e di Tossicologia “A. Bertinaria”, Regione Gonzole 10/1, 10043 - Orbassano (TO), Italy
Dipartimento di Biologia e Biotecnologie “Charles Darwin”, Sapienza Università di Roma, P.le Aldo Moro 5, 00185 - Rome, ItalyCentro Regionale Antidoping e di Tossicologia “A. Bertinaria”, Regione Gonzole 10/1, 10043 - Orbassano (TO), Italy
The capability to achieve biogeographic ancestry (BGA) information from DNA profiles have been largely explored in forensic genetics because of its potential usefulness in providing investigative clues. For law enforcement and security purposes, when genetic data have been obtained from unknown evidence, but no reference samples are available and no hints come out from DNA databases, it would be extremely useful at least to infer the ethno-geographic origin of the stain donor by just examining traditional STRs DNA profiles.
Current protocols for ethnic origin estimation using STRs profiles are usually based on Principal Component Analysis approaches and Bayesian methods. The present study provides an alternative approach that involves the use of target multivariate data analysis strategies for estimation of the BGA information from unknown biological traces. A powerful multivariate technique such as Partial Least Squares-Discriminant Analysis (PLS-DA) has been applied on NIST U.S. population datasets containing, for instance, the allele frequencies of African-American, Asian, Caucasian and Hispanic individuals. PLS-DA approach provided robust classifications, yielding high sensitivity and specificity models capable of discriminating the populations on ethnic basis. Finally, a real casework has been examined by extending the developed model to smaller and more geographically-restricted populations involving, for instance, Albanian, Italian and Montenegrian individuals.
STR markers are widely utilized for the interpretation process of single source and DNA mixtures samples collected during crime scene investigation activities. On the other hand, the continuous technical improvements in forensic genetics related to the so-called Next Generation Sequencing (NGS) or Massive Parallel Sequencing (MPS) allowed analysts to focus on specific markers such as Single Nucleotide Polymorphisms (SNPs) in order to assess the information about the geographical ancestry and ethnic origins of individuals, even from degraded samples [
], turning capable of estimating the ethnic affiliation of unknown genetic profiles. However, multivariate classification techniques such as Partial Least Squares-Discriminant Analysis (PLS-DA) may provide useful further advantages when evaluating the ancestry of unknown subjects’ genetic profiles such as, for instance, the possibility of recognizing the markers or, eventually, the alleles that significantly distinguish two or more populations under comparison. The use of this approach allows to simultaneously perform specific and sensitive discriminations among different ancestry groups, even when the populations are geographically related.
2. Material studied, methods, techniques
2.1 DNA data
The study was performed by testing NIST U.S. population datasets as reference database, which contains the allele frequencies of 29 autosomal STRs from selected U.S. African-American, Asian, Caucasian and Hispanic populations [
]. It is composed by 1036 subjects, divided into 342 African-American, 361 Caucasian, 236 Hispanic and 97 Asian individuals. Then, with the aim of evaluating a known DNA profile related to a real casework that had been previously genotyped in our laboratories, allele frequencies databases containing the STRs frequencies of individuals from Albania [
] were taken into account to estimate the BGA affiliation of a Caucasian individual according to preliminary investigations.
2.2 Data analysis and statistics
Multivariate models were built on simulated DNA profiles extracted from the NIST U.S. population database. These genotypes were generated using the forensim package [
]; this package allows to simulate series of independent DNA profiles by considering the allele frequencies reported in reference populations databases. As remarked by Haned [
], when simulating the DNA profiles two alleles are randomly selected at each given locus by considering the reported allele frequencies. The four simulated datasets (i.e. one for each NIST U.S. population group) contained the same number of subjects equal to 500 DNA profiles in order to uniform the number of subjects of each affiliation group. A comprehensive dataset of 2000 simulated DNA profiles was assembled and then randomly split into two different datasets, consisting in a training set and an evaluation set, with a number of 1600 and 400 individuals, respectively. The same approach was performed on the Albanian, the Italian and the Montenegrian databases by simulating 600 independent individuals for all the tested populations (subjects have been divided into training and evaluation sets consisting of 1500 and 300 individuals, respectively). Finally, PLS-DA [
] models were calculated to assess their predictive capabilities in terms of BGA estimation by discriminating the individuals from the different populations. Models are built in accordance with the probability distributions of the most discriminant loci and alleles relative to the populations to which the subjects belong. Each of the tested simulated individual is then discriminated into the specific BGA affiliation that shows the highest score in terms of probability.3
3. Results
A PLS-DA model was evaluated for the different NIST U.S. populations, as reported in Fig. 1. The simulated genetic profiles of the training set are reported on the left side of the Scores Plots (Fig. 1, a.-d.), while the ones contained into the evaluation set are showed on the right side of the Scores Plots (Fig. 1, a.-d.). The discrimination model showed sensitivity and specificity values equal to 99% for the African-American individuals, 94% and 98.0% for the Asian subjects, and 94% and 97% for the Caucasian population, thus indicating optimal predictive PLS-DA models. A robust, but less predictive, model for the discrimination of the Hispanic BGA was obtained, too, showing sensitivity and specificity values equal to 86% and 97%, respectively.
Fig. 1PLS-DA Scores Plot relative to models provided for the discrimination of (a.) African-American (red diamonds), (b.) Asian (green squares), (c.) Caucasian (blue triangles) and (d.) Hispanic (light blue triangles) ethnicities. (e.) PLS-DA Scores Plot (the Albanian subjects are indicated by red diamonds, the Italian subjects are remarked by green squares and the Montenegrian individuals are indicated by blue triangles) relative to models provided for the estimation of the affiliation of a Caucasian individual (remarked by light blue triangles) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).
PLS-DA proved to be an useful tool for the estimation of the BGA origin of STRs DNA profiles. The lower sensitivity values obtained for the Hispanic population could be related to a higher rate of admixtures within the collected individuals, when compared to the other tested BGA affiliations. Since this approach turned capable of distinguishing different populations, PLS-DA was tested on more geographically-restricted populations, such as Albanian, Italian and Montenegrian individuals, aiming to predict the BGA origin for an individual defined as Caucasian. A novel PLS-DA model was calculated once again by using simulated STRs DNA profiles, as reported in Fig. 1(e.). In the present case, the individual under examination has been predicted within the Montenegrian population; this result was later confirmed by law enforcement investigations, thus showing the reliability of the used multivariate approach.
5. Conclusions
PLS-DA approach on simulated STRs DNA datasets provided robust predictive models, yielding high sensitivity and specificity values for the discrimination of selected populations on BGA affiliation basis. A real casework was examined, too, with the aim of extending the tested model to smaller and more specific populations. This application might represent a powerful tool for law enforcement agencies whenever a biological evidence is collected at a crime scene or recovered during mass-disaster and missing persons investigations.
Declaration of Competing Interest
None.
References
Phillips C.
Santos C.
Fondevila M.
Carracedo Á.
Lareu M.V.
Inference of ancestry in forensic analysis I: autosomal ancestry-informative marker sets.