Originally Published: Vol 9, Num 1 (Fall 2021)
Reference Number: 91.007
Note: The full article is too long to reproduce here but the Summary and Introduction s are provided below. For the full article please use the View PDF link.
The calculation of a “Time to Most Recent Common Ancestor” (TMRCA) using relevant SNP data can be so simple that many genealogists are tempted to use this tool and to draw inaccurate, imprecise and unwarranted conclusions, however unintentionally. Conversely the calculation of the Confidence Intervals (CIs) that should accompany such calculations is complex and rarely attempted. This paper is not promoting some new panacea, but draws in part on a novel analysis of 17 samples of SNP counts to help genealogists to understand why the popular use of SNP-based TMRCAs without CIs is misguided, why in practice these CIs are difficult to calculate, how curious genealogists can readily estimate indicative CIs for their own data, and why a growing number of genealogists are recognising that the inherent uncertainties which CIs quantify are so great that SNP-based TMRCAs are usually of much less practical use than is often assumed. The development of practical models that include inputs of STR and historical data can reduce these uncertainties, but the temptation to use and mis-interpret simplistic SNP-based TMRCA calculations is not going to disappear.
The use of DNA data to calculate TMRCAs is a long-standing objective of the genetic genealogy community. The advent of Next Generation Sequencing (NGS) Y-DNA tests, such as FamilyTreeDNA’s BigY test and the resulting SNP haplotrees, appear to offer a significant step towards this goal: many see SNP data as being more reliable than STR data, the calculation of SNP-based TMRCAs can be very simple, published SNP mutation rates are accompanied by a 95% confidence measure which, if not fully understood, seems to add comfort, and above all the resulting TMRCAs appear to be “in the right ball park”. Though adopting very different approaches, both Dave Vance’s SAPP model[1] and Iain McDonald’s recent paper[2] appear to build on these lay perceptions and to supplement the basic SNP-based model with STR and historical inputs. However, the application of such models is too demanding for many administrators (“admins”) of surname DNA projects, and as noted in his Conclusion, McDonald has not addressed some of the associated practical problems.
This paper addresses challenges presently facing project admins with limited mathematical skills and/or small samples when they attempt to estimate TMRCAs from SNP inputs derived from NGS tests such as BigY700. The following challenges are addressed in turn: