Advertisement
Research Article| Volume 7, ISSUE 1, P637-638, December 2019

Applying machine learning algorithms to a real forensic case to predict Y-SNP haplogroup based on Y-STR haplotype

  • Author Footnotes
    1 The author contributed equally to this work and should be considered co-first author.
    Mengyuan Song
    Footnotes
    1 The author contributed equally to this work and should be considered co-first author.
    Affiliations
    Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China
    Search for articles by this author
  • Author Footnotes
    1 The author contributed equally to this work and should be considered co-first author.
    Chenxi Zhao
    Footnotes
    1 The author contributed equally to this work and should be considered co-first author.
    Affiliations
    College of Computer Science, Sichuan University, Chengdu 610065, China
    Search for articles by this author
  • Zheng Wang
    Affiliations
    Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China
    Search for articles by this author
  • Yiping Hou
    Correspondence
    Corresponding author at: Institution: Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, 3-16 Renmin South Road, Chengdu 610041, China.
    Affiliations
    Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China
    Search for articles by this author
  • Author Footnotes
    1 The author contributed equally to this work and should be considered co-first author.
Published:October 18, 2019DOI:https://doi.org/10.1016/j.fsigss.2019.10.120

      Highlights

      • the Y chromosomal haplogroups of two samples were predicted.
      • high-resolution haplogroups were applied in the dataset.
      • machine learning methods show high capability in the prediction.

      Abstract

      Y-chromosome single nucleotide polymorphisms (Y-SNPs) have lower mutation rate compared with Y-chromosome short tandem repeats (Y-STRs), thus more informative in paternal lineage identification. Here we present a case about the personal identification of an unidentified cadaver using machine learning methods to determine Y-SNP haplogroup by Y-STR haplotype. Two possible haplotypes from two different male lineages were found after searching national Y-STR databases. Six methods, k-Nearest Neighbor, Naive Bayesian Model, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest were used to predict the haplogroup based on Y-STR haplotype. These two haplotypes are predicted into two different haplogroups, O2a2b1a2a1 and O2a2b1a2a1a3. The predicted results were further verified by Y-SNP genotyping. It indicates that the mismatch of the two haplotypes may not originate from mutation, but due to different lineages. In this case, machine learning algorithms, especially Support Vector Machine and Random Forest show the potential of discriminating different lineages.

      Keywords

      1. Introduction

      Y chromosomes are especially useful when there are mixtures from male and female biological materials. Analyzing Y chromosome is helpful for paternal lineage identification, thus narrowing down the investigation. Due to lower mutation rate, Y-chromosome single nucleotide polymorphisms (Y-SNPs) are more informative in genealogy studies compared to Y-chromosome short tandem repeats (Y-STRs). There are several methods for predicting the Y chromosome haplogroup according to Y-STR haplotype [
      • Athey T.W.
      Haplogroup prediction from Y-STR values using an allele-frequency approach.
      ,
      • Athey T.W.
      Haplogroup prediction from Y-STR values using a bayesian-allele-Frequency approach.
      ,
      • Schlecht J.
      • Kaplan M.E.
      • Barnard K.
      • Karafet T.
      • Hammer M.F.
      • Merchant N.C.
      Machine-learning approaches for classifying haplogroup from Y chromosome STR data.
      ]. Most of them are based on probabilistic models, such as Bayesian models. However, there are two problems: (1) the resolution of the haplogroup prediction is not high; (2) there is no real application in the crime scene investigation. Here we use our data set to train six machine learning algorithms to get a high-resolution prediction and then demonstrate its use in a real case to distinguish different haplogroups.

      2. Materials and methods

      We used the data of 141 Y-SNPs and 27 Y-STRs genotyped in our laboratory containing 3353 samples from Han, Hui, Tibetan and Uyghur to train six models, which are k-Nearest Neighbor, Naive Bayesian Model, Logistic Regression, Support Vector Machine, Decision Tree, and Random Forest. We made the prediction in different resolution levels, of which the highest level is as O1b1a1a1a1a1a1a1a1b. After cross-validation, we use these models to predict the unknown samples from the crime scene investigation. Then unknown sample is from an unidentified cadaver. Two possible haplotypes from two different male lineages were found after searching national Y-STR databases. They just have one mismatch in locus DYS449.
      After prediction, the unknown samples were also genotyped for assigning the haplogroup using the same panel of Y-SNPs.

      3. Results and discussion

      The accuracy of the six models under the lowest and highest resolution level is displayed in Table 1. From the table, Random Forest is the most accurate model (0.984 in the lowest resolution; 0.770 in the highest resolution), and then it goes to Support Vector Machine (0.971 in the lowest resolution; 0.695 in the highest resolution). The prediction of two unknown samples are shown in Table 2. In the lowest resolution, these two haplotypes were both predicted into haplogroup O, while in the highest resolution, they were predicted into haplogroup O2a2b1a2a1 or O2a2b1a2a1a3.
      Table 1The predicting accuracy of six models.
      ModelsAccuracy in the lowest resolutionAccuracy in the highest resolution
      k-Nearest Neighbor0.9790.691
      Naive Bayesian Model0.7330.492
      Logistic Regression0.9700.669
      Support Vector Machine0.9710.695
      Decision Tree0.9510.613
      Random Forest0.9840.770
      Table 2The predicting results of two samples.
      ModelsPredicted haplogroup in the lowest resolutionPredicted haplogroup in the highest resolution
      Haplotype1Haplotype2Haplotype1Haplotype2
      k-Nearest NeighborOOO2a2b1a2a1a3O2a2b1a2a1a3
      Naive Bayesian ModelOOO2a2b1a2a1a3b2b2O2a2b1a2a1a3b2b2
      Logistic RegressionOOO2a2b1a2a1O2a2b1a2a1
      Support Vector MachineOOO2a2b1a2a1O2a2b1a2a1a3
      Decision TreeOOO2a2b1a2a1a3O2a2b1a2a1a3
      Random ForestOOO2a2b1a2a1O2a2b1a2a1a3
      After obtaining the results from haplogroup testing, these two haplotypes were validated to come from two paternal lineages, which are O2a2b1a2a1 and O2a2b1a2a1a3 respectively. The relationship of these two male individuals is confirmed to be uncle-nephew after further investigation. The results of haplogroup are in concordance with the results predicted by Support Vector Machine and Random Forest.

      4. Conclusion

      Our study indicates that the mismatch of the two haplotypes may not originate from mutation, but due to different lineages. In this case, machine learning algorithms, especially Support Vector Machine and Random Forest show the potential of discriminating different lineages.

      Declaration of Competing Interest

      The authors declare that they have no conflict of interest.

      Acknowledgments

      This study was supported by grant from the National Natural Science Foundation of China (No.818715320, No. 81601648).The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

      References

        • Athey T.W.
        Haplogroup prediction from Y-STR values using an allele-frequency approach.
        J. Genet. Geneal. 2005; 1: 1-7
        • Athey T.W.
        Haplogroup prediction from Y-STR values using a bayesian-allele-Frequency approach.
        J. Genet Genealogy. 2006; 2: 34-39
        • Schlecht J.
        • Kaplan M.E.
        • Barnard K.
        • Karafet T.
        • Hammer M.F.
        • Merchant N.C.
        Machine-learning approaches for classifying haplogroup from Y chromosome STR data.
        PLoS Comput. Biol. 2008; 4e1000093