Advertisement

Introduction of the python script MHinNGS for analysis of microhaplotypes

  • Carina G. Jønck
    Affiliations
    Section of Forensic Genetics, Department of Forensic Medicine, University of Copenhagen, Denmark
    Search for articles by this author
  • Claus Børsting
    Correspondence
    Correspondence to: Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Frederik V's Vej 11, DK-2100 Copenhagen, Denmark.
    Affiliations
    Section of Forensic Genetics, Department of Forensic Medicine, University of Copenhagen, Denmark
    Search for articles by this author
Published:September 29, 2022DOI:https://doi.org/10.1016/j.fsigss.2022.09.029

      Abstract

      MHinNGS is a Python application developed for analysis of microhaplotypes (MHs) in single-end sequencing data. MHinNGS analyses reads in standard formats and store each sequence into bins, one bin for each MH as defined by the two flanking sequences. MHinNGS requires a reference genome and a configuration file with information about each locus. Four mandatory and 15 optional criteria defined in the configuration file allow detailed locus-specific analyses of the MH loci. The program 1) removes noise, 2) identify and name alleles, 3) test the genotypes, and 4) test unique sequences not identified as noise or alleles. MHinNGS produces a result file, where every unique sequence that passed the noise filter is presented with MH allele, read depth, warning flags based on the genotyping criteria, sequence, heterozygote balance, and MH name. Furthermore, variation in other parts of the fragment that is not defined as SNPs in the MH, linked variants, or rare SNPs are listed in a separate column of the result file.

      Keywords

      1. Introduction

      Microhaplotypes (MHs) consist of two or more polymorphic loci (typically SNPs or small indels) within a short stretch of DNA (typically 2–300 nucleotides) [
      • Kidd K.K.
      • Pakstis A.J.
      • Speed W.C.
      • Lagacé R.
      • Chang J.
      • Wootton S.
      • Haigh E.
      • Kidd J.R.
      Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics.
      ]. The relative short distances between the variants allow for efficient PCR amplification and sequencing of the entire amplicon, which makes PCR-NGS assays targeting MH loci highly sensitive and potentially interesting for forensic genetic applications [
      • Oldoni F.
      • Kidd K.K.
      • Podini D.
      Microhaplotypes in forensic genetics.
      ,
      • Børsting
      • Morling
      Next generation sequencing and its applications in forensic genetics.
      ].
      MHs have three important advantages compared to the standard STR loci used in forensic genetics: 1) Amplification of MHs do not generate stutter artefacts, that complicates data analysis of mixture samples [
      • Gill P.
      • Haned H.
      • Bleka O.
      • Hansson O.
      • Dørum G.
      • Egeland T.
      Genotyping and interpretation of STR-DNA: Low-template, mixtures and database matches-twenty years of research and development.
      ]. 2) The mutation rates of MHs are 4–6 orders of magnitude lower than the mutation rates of STRs [
      • Kidd K.K.
      • Pakstis A.J.
      • Speed W.C.
      • Lagacé R.
      • Chang J.
      • Wootton S.
      • Haigh E.
      • Kidd J.R.
      Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics.
      ], which is particularly important for relationship testing. 3) The amplicon lengths of the different MH alleles are the same. This prevents NGS read count variation due to differently sized alleles, which is observed for most STRs [
      • Fordyce S.L.
      • Mogensen H.S.
      • Børsting C.
      • Lagacé R.E.
      • Chang C.W.
      • Rajagopalan N.
      • Morling N.
      Second-generation sequencing of forensic STRs using the Ion Torrent™ HID STR 10-plex and the Ion PGM™.
      ,
      • Hussing C.
      • Huber C.
      • Bytyci R.
      • Mogensen H.S.
      • Morling N.
      • Børsting C.
      Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system - an evaluation of the assay and software.
      ,
      • Simayijiang H.
      • Morling N.
      • Børsting C.
      Sequencing of human identification markers in an Uyghur population using the MiSeq FGxTM Forensic Genomics System.
      ] and may be a problem in the analysis of highly degraded samples.

      2. Materials and methods

      MHinNGS is a freely available python script (https://hub.docker.com/r/bioinformatician/mhinngs) developed for analysis of MHs in single-end sequencing data. MHinNGS is built upon the program STRinNGS v2.0 [
      • Jønck C.G.
      • Qian X.
      • Simayijiang H.
      • Børsting C.
      STRinNGS v2.0: improved tool for analysis and reporting of STR sequencing data.
      ], that is used for analysis of STR sequences, and they have many similar features. MHinNGS needs three input files: 1) One file or folder containing the reads (FASTQ, FASTA, BAM, SAM, or CRAM format), 2) A reference genome in FASTA format, and 3) A configuration file containing information about each locus. The configuration file has five mandatory elements and 15 optional criteria (Table 1).
      Table 1Criteria and settings.
      SettingsExamplePossible flag(s)
      Mandatory information
      locus namemh11KK-180
      Two additional SNPs (rs12360952 and rs4752778) were included in the original MH defined by Kidd and co-workers [9].
      chromosomechr11
      start1669536
      stop1669769
      mh_infors12802112:1669561:GAT*AGA:AG"Unexpected microhaplotype"
      rs12360952:1669657:GG*GAC:TC
      rs28631755:1669681:GGG*TTT:AC
      rs4752778:1669720:ATG*TAA:CT
      rs7112918:1669739:CCA*GAA:CT
      rs4752777:1669754:TGA*GCA:CG
      Optional criteria
      ignore_pos1669594.1 G,1669676
      rare_snprs117851656:1669542:T"Rare SNP"
      linked_allele1669540:ACCTTG:A"Linked allele not linked"
      Default criteria
      flank_up_length-15
      flank_down_length-15
      mism_up1
      mism_down1
      noise_filter (>=)0.01
      min_reads (>=)100"Too few reads" / "Locus Dropout"
      max_num_unique4"Many unique sequences"
      min_frac_genotype0.8"Three alleles" / "More than three alleles"
      hetero_balance0.25;0.75"Heterozygote imbalance"
      max_reads_unique_not_called0.1"Sequence with many reads not called"
      min_unex15"Unexpected sequence detected"
      slide2
      a Two additional SNPs (rs12360952 and rs4752778) were included in the original MH defined by Kidd and co-workers
      • Kidd K.K.
      • Speed W.C.
      • Pakstis A.J.
      • Podini D.S.
      • Lagacé R.
      • Chang J.
      • Wootton S.
      • Haigh E.
      • Soundararajan U.
      Evaluating 130 microhaplotypes across a global set of 83 populations.
      .
      MHinNGS output consist of three files: 1) A log file containing various information about the run such as program version, input files, and parameter settings. 2) A result file in csv format with filtered data and all comments, and 3) A file named raw_results in csv format that contains all data including noise sequences, but without allele name, comments, and heterozygote balance.

      3. Results and conclusions

      In short, MHinNGS collects and stores sequences in bins, one bin for each MH, according to the two flanking sequences (‘flank_up_length’ and ‘flank_down_length’ in Table 1). Next, the program removes noise, identifies and names alleles, tests the genotype, and tests unique sequences (Fig. 1), that were not identified as either noise or alleles. In addition to the criteria defined in STRinNGS [
      • Jønck C.G.
      • Qian X.
      • Simayijiang H.
      • Børsting C.
      STRinNGS v2.0: improved tool for analysis and reporting of STR sequencing data.
      ], four criteria have been added to the MHinNGS configuration file: ‘mh_info’, ‘slide’, ‘linked_allele’ and ‘rare_snp’ (Table 1). Each variant (SNP or indel) of the MH is defined in the configuration file (‘mh_info’ in Table 1) with rs number (if known), genome position, surrounding nucleotides, and possible alleles. The variant is identified by searching for the surrounding nucleotides to the variant position. The surrounding nucleotides must be an exact match. If a match is not found, the program will slide one nucleotide to the left or right, and try again, until the surrounding nucleotides match or the slide maximum (‘slide’ in Table 1) is reached. MHinNGS also searches for additional variants between the start and stop position (Table 1). If a variant is identified, the position and base call is indicated in the result file (Supplementary Tables 1 and 2), but it is not included in the MH name.
      Fig. 1
      Fig. 1Genotype calling with MHinNGS. There are three groups of reads for a locus as indicated on the left. ‘Total reads’ are all reads identified via the upstream and downstream flank. The ‘Reads for genotype call’ are all the reads that are left after noise reads have been removed. The ‘Genotype reads’ are the reads that make up the genotype. Thresholds and possible flags () for each group of reads are indicated on the right.
      In the configuration file, it is possible to ignore specific positions (‘ignore_pos’ in Table 1) with frequent errors, that generate multiple unique sequences (example in Supplementary Table 1). Furthermore, it is possible to define rare variants (‘rare_snp’ in Table 1), that are not part of the MH, with rs number, genome position, and alternative allele. If the alternative allele is detected, a warning flag is raised (“Rare SNP”) in the comment column of the result file. However, the SNP is not included in the MH name.
      Linked alleles may be defined in the configuration file (linked_allele in Table 1) with genome position, the MH allele that the allele is linked to, and the variant allele. If the allele is detected and the MH allele is identical to the expected, linked MH allele, the position and base call of the SNP allele is not shown in the results file. If another haplotype is detected, the flag “Linked allele not linked” is shown in the comment column of the result file.
      In conclusion, MHinNGS is a freely available MH analysis software that provide the user with maximum flexibility and complete control of the analysis process.

      Conflict of interest

      None.

      Appendix A. Supplementary material

      References

        • Kidd K.K.
        • Pakstis A.J.
        • Speed W.C.
        • Lagacé R.
        • Chang J.
        • Wootton S.
        • Haigh E.
        • Kidd J.R.
        Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics.
        Forensic Sci. Int. Genet. 2014; 12: 215-224
        • Oldoni F.
        • Kidd K.K.
        • Podini D.
        Microhaplotypes in forensic genetics.
        Forensic Sci. Int. Genet. 2019; 38: 54-69
        • Børsting
        • Morling
        Next generation sequencing and its applications in forensic genetics.
        Forensic Sci. Int. Genet. 2015; 18: 78-89
        • Gill P.
        • Haned H.
        • Bleka O.
        • Hansson O.
        • Dørum G.
        • Egeland T.
        Genotyping and interpretation of STR-DNA: Low-template, mixtures and database matches-twenty years of research and development.
        Forensic Sci. Int. Genet. 2015; 18: 100-117
        • Fordyce S.L.
        • Mogensen H.S.
        • Børsting C.
        • Lagacé R.E.
        • Chang C.W.
        • Rajagopalan N.
        • Morling N.
        Second-generation sequencing of forensic STRs using the Ion Torrent™ HID STR 10-plex and the Ion PGM™.
        Forensic Sci. Int. Genet. 2015; 14: 132-140
        • Hussing C.
        • Huber C.
        • Bytyci R.
        • Mogensen H.S.
        • Morling N.
        • Børsting C.
        Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system - an evaluation of the assay and software.
        Forensic Sci. Res. 2018; 3: 111-123
        • Simayijiang H.
        • Morling N.
        • Børsting C.
        Sequencing of human identification markers in an Uyghur population using the MiSeq FGxTM Forensic Genomics System.
        Forensic Sci. Res. 2022; 7: 154-162
        • Jønck C.G.
        • Qian X.
        • Simayijiang H.
        • Børsting C.
        STRinNGS v2.0: improved tool for analysis and reporting of STR sequencing data.
        Forensic Sci. Int. Genet. 2020; 48102331
        • Kidd K.K.
        • Speed W.C.
        • Pakstis A.J.
        • Podini D.S.
        • Lagacé R.
        • Chang J.
        • Wootton S.
        • Haigh E.
        • Soundararajan U.
        Evaluating 130 microhaplotypes across a global set of 83 populations.
        Forensic Sci. Int. Genet. 2017; 29: 29-37