Advertisement

Orogen: Advanced relationship predictions for genetic genealogy

Published:September 28, 2022DOI:https://doi.org/10.1016/j.fsigss.2022.09.011

      Abstract

      Orogen advances the science of relationship prediction by using Ped-sim, which improves upon previous models by incorporating crossover interference and sex-specific genetic maps. The Orogen tool provides accurate relationship predictions for a wide range of relationship types. It properly differentiates between close relatives at 23andMe, which is a newly available functionality for standalone tools. It provides new granularity of close relationships by showing the differences between paternal and maternal sides and in-group relationship types.

      Keywords

      1. Introduction

      Relationship prediction is important for determining kinship with DNA matches by showing all possible relationships and ranking them in the most probable order. Direct-to-consumer DNA testing services have used simulations to provide customers with relationship estimations for several years [
      • Ball Catherine A.
      • et al.
      Ancestry DNA matching white paper.
      ]. Previously, most simulations ignored crossover interference and relied on sex-averaged genetic maps. Results from Ped-sim [
      • Caballero Madison
      • et al.
      Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives.
      ], a genetic model that can utilize sex-specific genetic maps [
      • Bhérer Claude
      • Christopher L. Campbell
      • Auton Adam
      Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales.
      ] and simulates crossover interference [
      • Campbell Christopher L.
      • et al.
      Escape from crossover interference increases with maternal age.
      ], show that these features have substantial effects on the shared DNA between relatives.
      Population weights are also important. A person has many more distant cousins than close cousins, so when a person looks at their DNA matches they are seeing a sample that is biased towards distant cousins. Population weights overcome that bias, resulting in much higher probabilities for more distant cousins. Henn et al. [
      • Henn Brenna M.
      • et al.
      Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples.
      ] showed that the number of cousins that a person has will grow exponentially and indefinitely with increasing degree of cousinship. This model ignores pedigree collapse and thus can only be used to a certain degree of cousinship. Population models have also so far ignored cousins once removed, who along with half cousins will make up a large portion of a person’s DNA matches.
      Here I introduce Orogen relationship predictions, which incorporate the advanced modeling of Ped-sim and population weights that approximate the average number of cousins a person has in order to overcome the bias towards distant DNA matches.

      2. Methodology

      There are five main steps necessary for producing accurate relationship predictions.

      2.1 Simulating shared DNA between relatives

      Simulations were run in Ped-sim with 500,000 trials for each relative type with the exceptions that parent/child relationships were not simulated and that maternal paternal relationships were used in place of paternal maternal relationships since the shared DNA has the same properties. I used sex-specific genetic maps and crossover interference for all simulations. Half relationships more distant than Group 3 and cousins two times or more removed were not included.
      I created a conversion factor based on the total genetic map length (3346 cMs) of Bhérer et al. [
      • Bhérer Claude
      • Christopher L. Campbell
      • Auton Adam
      Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales.
      ], used by Ped-sim, and the different genetic lengths at 23andMe (3538 cMs) and AncestryDNA (3489 cMs) (data collected by the author). For simplification, I then applied this conversion to each segment in Ped-sim output files as well as low cM cutoffs to each segment in accordance with the thresholds used at 23andMe and AncestryDNA [

      Autosomal DNA Match Thresholds, International Society of Genetic Genealogy Wiki, 2022. 〈https://isogg.org/wiki/Autosomal_DNA_match_thresholds〉.

      ]. For percentages, which are reported by 23andMe including X-DNA, I kept X-DNA in totals, but excluded it for cM input boxes. The genetic length of X-DNA from the Bhérer et al. [
      • Bhérer Claude
      • Christopher L. Campbell
      • Auton Adam
      Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales.
      ] map is 176.3 cMs and is ∼182 cMs at 23andMe for one chromosome copy (data collected by the author). I included options for two female testers or for two male testers, which require different simulation parameters.
      For parent/child relationships, I generated a normal distribution to generate shared cMs for parent/child relationships after genotyping errors. I did this by approximately matching the distribution to Fig. 5.3 of the 2020 AncestryDNA Matching White Paper [

      Shared cM Project 4.0 Tool v4 with Relationship Probabilities, DNA Painter, 2022. 〈dnapainter.com.tools/sharedcmv4〉.

      ]. I then kept only the lower half of this distribution, which has the genetic map length of the respective companies as both a maximum and a mode.

      2.2 Finding counts for each relationship type in bins

      I established 1 cM bins that are open on the left, centered on integers, and closed on the right, placing the total counts for each relationship type into each bin. In order to represent each group equally, I divided the counts for each relationship type by the number of types in that group. For example, Group 2 comprises six different types, including paternal half-siblings, maternal half-siblings, paternal grandparent/grandchild, etc. Each of these six types represents 500,000 pairs of individuals. I divided the counts for all relationship types in Group 2 so that it would have equal representation compared to Groups 5–16, which each comprise 500,000 cousin pairs.

      2.3 Smoothing the counts across bins for each relationship type

      Even with 500,000 trials, plots of the counts for a given relationship type across bins will not be smooth. To avoid probability curves that are non-monotonic, I smoothed the counts for each relationship type across bins by applying moving window averages iteratively, with the window size usually decreasing on each pass.

      2.4 Applying population weights

      I used a simple population model and assumed, like Henn et al. [
      • Henn Brenna M.
      • et al.
      Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples.
      ], a constant survival rate (SR) of 2.5 children per family. For population weights in this study I ignored half relationships, cousins twice removed or more, and descendants. The formula for the number of nth cousins a person has, on average, under these assumptions is as follows: (SR1)×2g1×SRg1, where g is the number of generations back to the common ancestor and n=g1. The formula for cousins once removed is as follows: (SR1)×2g1×SRg+(SR1)×2g×SRg1, where the first addend is for the younger generation and the second addend is for the older generation. In order to have smoother growth across groups, I used only the average of the two types of cousins once removed. For every bin for a given relationship type, I then multiplied the count by the group population weights described above.

      2.5 Calculating probabilities for each relationship type

      Finally, I calculated the relative probabilities for each relationship type by dividing its count by the total count of all relationship types in each bin. For the percentage input boxes, I divided the cM amounts by the total genetic map, including X-DNA, to obtain a percentage.

      3. Results and discussion

      Un-smoothed probability curves (Fig. 1a) would give varying predictions for small changes in cM inputs. Fig. 1b shows smoothed but unweighted probability curves, which appear to be monotonic over the appropriate intervals. The population weighted curves in Fig. 1c are only subtly different than the unweighted probability curves. More distant relationships are shifted to the right and slightly lower, but close relationships appear unchanged.
      Fig. 1
      Fig. 1Un-smoothed probability curves with no population weights (a), smoothed curves with no population weights (b), and curves with smoothing followed by population weights (c) for various relationship types at AncestryDNA. All IBD sharing values in this figure use the half-identical region (HIR) metric, as do AncestryDNA reports. Each probability is relative to all other relationship types possible for a given cM value. P/Ch., parent/child; Full-Sib., full-sibling; GP/GCh., grandparent/grandchild; Half-Sib., half-sibling; Avunc., avuncular (aunt/uncle/niece/nephew); G-, great.
      The resulting tool can be found at https://dna-sci.com/tools/orogen-wtd/, where users can obtain probabilities for relationships by entering values of 8 cMs or higher as inputs Rather than display the probabilities by relationship type groups with the same average shared DNA, Orogen shows them with finer resolution by relationships with curves that are significantly different in Fig. 1.
      Orogen provides correct relationship predictions for average values, but excels at close relationships with values farther from the mean. A few examples will illustrate this. Two paternal half-siblings share 2055 cMs at AncestryDNA (data collected by the author). Using another tool found at DNA Painter [

      Shared cM Project 4.0 Tool v4 with Relationship Probabilities, DNA Painter, 2022. 〈dnapainter.com.tools/sharedcmv4〉.

      ], whose probabilities come from the 2016 AncestryDNA White Paper, all of the Group 2 probabilities are combined. The values are similar between the two tools. DNA Painter shows a 92% probability for Group 2 while the Orogen probability is 95.7%. The benefit of Orogen in this case is that it shows that paternal relationships are more likely and that maternal and avuncular relationships are less likely.
      As another example, a woman and her known paternal grandmother share 38.2% DNA at 23andMe (data collected by the author) with no pedigree collapse, since her father has also been tested. The relationship predictor at DNA Painter converts this percentage to 2842 cMs and gives a 100% chance of full-siblings, leaving no possibility for paternal grandmother. Orogen predicts the correct relationship with a ∼20% probability. Although an ∼80% probability is given for full-siblings, it is not possible since no completely identical regions were shared.
      Orogen also accurately predicts full-sibling relationships at 23andMe. A typical value for full-siblings at 23andMe is 3500 cMs, just under the theoretical average. This is given a ∼15% probability at Orogen and a 0% probability at DNA Painter.
      Future work will concentrate on a population model that includes pedigree collapse.

      Conflict of interest statement

      The author has no conflict of interest to declare.

      References

        • Ball Catherine A.
        • et al.
        Ancestry DNA matching white paper.
        Ancestry Com. 2016; : 1-46
        • Caballero Madison
        • et al.
        Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives.
        PLoS Genet. 2019; 15.12e1007979
        • Bhérer Claude
        • Christopher L. Campbell
        • Auton Adam
        Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales.
        Nat. Commun. 2017; 8.1: 1-9
        • Campbell Christopher L.
        • et al.
        Escape from crossover interference increases with maternal age.
        Nat. Commun. 2015; 6.1: 1-7
        • Henn Brenna M.
        • et al.
        Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples.
        PLoS One. 2012; 7.4e34267
      1. Autosomal DNA Match Thresholds, International Society of Genetic Genealogy Wiki, 2022. 〈https://isogg.org/wiki/Autosomal_DNA_match_thresholds〉.

      2. Shared cM Project 4.0 Tool v4 with Relationship Probabilities, DNA Painter, 2022. 〈dnapainter.com.tools/sharedcmv4〉.