John F. Chandler
Abstract
Relative perlocus mutation rates
for YDNA microsatellites, and also for mitochondrial
DNA singlenucleotide polymorphisms, can be estimated directly from diverse
collections of haplotypes without segregating population components. The calibrated average mutation rates for the
FTDNA panels of 12, 25, and 37 loci are 0.00187±0.00028, 0.00278±0.00042, and
0.00492±0.00074, respectively, from a collection of haplotypes from YSearch.
The individual perlocus rates are given here.
Address for correspondence: john.chandler (at) alum.mit.edu
Received:
Introduction
The task of estimating mutation rates for human DNA is
troublesome, whether for individual loci or for averages over panels of
loci. The mutations are so infrequent,
and the generation spans so long, that direct observation of a statistically
useful sample is expensive and timeconsuming.
Nonetheless, these rates are of great interest in genetic genealogy and
related fields. Techniques for circumventing
this basic difficulty to gather the necessary data include: “piggybacking” on
paternity tests (Gusmão et al. 2005), testing
judiciously chosen subjects with interconnected, deeprooted pedigrees (Heyer et al. 1997), and sifting large, heterogeneous
collections of measured haplotypes using statistical models to extract the mutation
rates from the other information present (Zhivotovsky
et al. 2004; Hutchison et al. 2004). The
latter technique needs calibration, but can be applied to data collected for
other purposes. Indeed, the very same
DNA testing that demands knowledge of the mutation rates for interpreting the
results can lead to the determination of those rates. Of particular interest is the technique
introduced by Hutchison et al (2004) of sorting pairs of haplotypes by
closeness and extracting information from the match profiles. This technique requires a cumbersome extra
step of identifying and isolating a subpopulation and depends on the dangerous
assumption that the chosen subpopulation is entirely characterized by a single
time depth, but there is a simpler method that avoids these problems. In this paper, I develop expressions for the
“mutation model curve” (MMC) of Hutchison et al. (2004) and outline a procedure
for using the highmatch end of the MMC for extracting mutation rates. Throughout this paper, the mutation rates are
assumed to be independent of time, environment, and haplotype. Since these assumptions may be false,
especially the last (Dupuy et al. 2004), the results
must be taken in the context of the data considered. Of these factors, only haplotype is
potentially accessible to a more detailed analysis of currently existing data.
Mathematical Model
First, we
must define a function _{} as the probability
that all loci, except locus j, match
between two haplotypes separated by a total of g generations. Each
haplotype consists of N loci, each
locus j with a different mutation
rate given by_{}. If we assume the mutation rates are small, we may
approximate mutation probabilities by an exponential function. In terms of the infinite alleles model, we
would write the probability of a single locus j remaining at the ancestral allele after g generations as_{}, while the probability of a mutation at that locus would be _{}. Then _{} is just the product of
the N individual probabilities at the
N loci:
_{} 1
where
the second line of Equation 1 comes from multiplying and dividing the first by _{}. Note that g, the separation between two
haplotypes, can be viewed either as the number of generations from an ancestor
being compared to a particular descendant or as the “roundtrip” number of
generations from one person back to a shared ancestor and then forward to the
contemporary being compared. Since most
DNA testing is done on living individuals, the latter interpretation of g is more commonly applicable, but the
former is equally valid. In the limit of
small _{} the choice of mutation
model is unimportant, and, to leading order in the small parameters, we would
have for either the infinite alleles model or the
stepwise mutation model:
_{} 2
where Q(g)
is the probability of a match on all loci in whichever model. Similarly, we may write the probability that
two haplotypes match at all loci except j,
k, and l as
_{} 3
and so on
for any number of mismatching loci. Next
we need an expression for the probability that a mutation has occurred at some
number of loci, b, while the
remainder have remained at the ancestral alleles. As an example, consider the case where
exactly three loci (any three) out of a total of five loci, have mutated. The probability of this occurring is just the
sum of the probabilities of all possible triplets of mutating loci (each given
by Equation 3):
_{} 4
where
we define _{} as the sum of all
products of b distinct elements of
the set _{}. For the general case, the probability that b loci out of N have mutated is given by
_{} 5
Now
consider the probability that a particular
locus has mutated along with any two others, i.e., the probability of a mismatch at the specified locus when all
but three of the loci match. This
probability is similar to that shown in Equation 4, except that the sum within
the brackets includes only those terms that contain, say, µ_{1}. That is
(continuing to use the example of b=3
and N=5), the probability is given by
_{} 6
where
we define a function _{}as the sum of all products of b1 distinct elements of the set _{}. The limiting case for both this new function and the
original unsubscripted C is defined as C(0) = 1. Thus, the probability of a
mismatch at locus j when all but b of the loci match, given a separation
of g generations, can be written as
_{} 7
where b must be greater than 0, since locus j is a mismatch by definition. Clearly, the probability must vanish when b is 0.
In general, we must deal with a population of haplotype pairs with a
range of separations, and we thus must calculate _{}, the overall probability of a mismatch at locus j for the whole population when all but b of the loci match. We do so by
weighting the probability in Equation 7 by the fraction _{} of the population of
pairs having a separation of g
generations and summing over g:
_{} 8
Similarly,
we may write D(b), the total probability of b
mismatches, in the same population as
_{} 9
The salient
feature of Equations 8 and 9 is the fact that they share a common factor
encapsulating the unknown population statistics _{}, and thus these equations differ only in terms that depend
just on the mutation rates. It is
therefore useful to define a mismatch profile function
_{} 10
This
function is the conditional probability of a mismatch on locus j, given b mismatches in all. This is directly related to the MMC as defined
by Hutchison et al. (2004); their _{} (probability of a
match at locus i
given n matches) is just _{}. By construction, _{} and _{}, since no locus can mismatch if the number of mismatches is
0, and every locus must mismatch if the number is N. Of course, the MMC will
depart increasingly from Equation 10 as b
grows beyond the linear limits assumed in Equation 2, since the population
structure will contribute differently to the nonlinear terms omitted from
Equations 8 and 9.
Equation
10 can be inverted to give an expression for the mutation rate in terms of the
(observed) mismatch profiles and the C functions:
_{} 11
Of course,
Equation 11 is recursive, in that the C
values depend on the mutation rates, but it can serve as the basis for an
iterative procedure for calculating the rates.
The definition of C involves
combinations of many terms—so many that the direct computation becomes
prohibitive at relatively modest values of b
and N. It can be shown that C obeys the recursion relation
_{} 12
where
_{} 13
In the
trivial case where the individual mutation rates are all the same, Equation 12
simplifies to
_{} 14
where
_{} is the uniform
mutation rate, and the parentheses indicate binomial coefficients, where the
coefficient “N take b” stands for N!/(b![Nb]!).
Also, Equation 10 simplifies to
_{} 15
Thus, in
the absence of population structure, the MMC for uniform mutation rates would
be straight lines from the (0, 0) to (N,
1), exactly as one would expect. This simple linear form suggests a practical
iterative procedure for determining the relative mutation rates from the
mismatch profiles:
1)
Choose a suitable value b as large as possible, but small enough that the mismatch
functions are small compared to 1.
2)
From the data, find the values _{} for all loci j.
3)
Initialize the mutation rate estimates to_{}, where r is a normalization
factor chosen to give values in a convenient range. (Since we have cancelled out the population
statistics, it is clear that we can obtain only relative rates from this
analysis, and so the normalization is arbitrary.)
4)
Use the mutation rate estimates to evaluate Equations 13,
12, and 11 to obtain a new set of estimated rates.
5)
Repeat Step 4 until convergence.
6)
Repeat Steps 25 for all values of b from 1 to the value chosen in Step 1 and take a weighted average
of the mutation rate estimates. (See below for a discussion of error analysis.)
Error
analysis
Analyzing
data in pairs instead of singly introduces correlations between pairs with
shared observations. Thus, a proper
error analysis of this procedure would be complicated by the need to deal with
all pairs of observations. Indeed,
correlations could even bias the
results. However, since the mismatch
profiles are computed on restricted subsets of the pairs, the pairtopair
correlation within each bin is greatly reduced.
Thus, a simple error analysis treating each pair as an independent
observable should suffice, with just one modification. Normally, the
leastsquares estimate of a probability p
from statistics of a population of N
cases carries an uncertainty of {p(1p)/N}^{0.5},
but the relevant N here must be the
lesser of the number of pairs found in a given bbin and the total number of haplotypes, since the latter is the
number of independent data. The uncertainties for the mismatch probability are
scaled to mutation rate uncertainties as in Equation 11, and these in turn are
used as the relative weights for the weighted average described in Step 6 above
(in the usual inversesquare sense).
Calibration
The final
component of this analysis is the calibration for converting the relative
mutation rates to absolute rates.
Pedigreebased rate determinations offer the advantage of “leverage,”
whereby each test subject carries an accumulation of mutations over many
generations (though not so many that multiple mutations on any one locus would
be an issue). However, the collection of
pedigree data from volunteers who already know the results of the DNA tests
leads to serious risks of selection bias in the data. At present, the only reliable large
collection of mutation statistics for YDNA microsatellites
is that collected from fatherson pair studies by Gusmão
et al. (2005). To make use of such
statistics, we must simultaneously fit the observables _{} (number of mutations
for locus i
in the calibration data) and _{} (the “observed”
relative mutation rate) with a simple model by weightedleastsquares analysis:
_{} 16
where _{} is the absolute
mutation rate for locus i,
c is the calibration factor between
relative and absolute rates, and _{} is the number of
meioses for locus i
in the calibration data. A third
relation using _{}and c′ can be
used to include a second, independent set of rate estimates in the analysis.
Application
to YSTRs
I have
applied this procedure to a collection of 8430 Y haplotypes downloaded from Ysearch in July of 2006 as a demonstration. In the parlance of Ysearch,
each haplotype consists of 37 loci, but this set includes five multicopy loci,
such that the number of independent loci is only 30. Each copy of a multicopy locus in one
haplotype must match the corresponding copy in the other haplotype if the locus
is to be considered a match. This
grouping of the loci avoids the necessity of guessing which copy corresponds to
which in the two haplotypes being compared.
About half of the collection belongs to haplogroup R1b as specified by
the contributors, and the other half is an assortment of other haplogroups.
Most of
the information carried by this collection is filtered out in Step 1 of the
analysis, since only nearlymatching pairs are considered. In Figure 1 and in each panel
of Figure 2, there is
a vertical line marking the lower limit of matching used in Steps 26, and it
is apparent from Figure 1
that only a tiny fraction of the pairs is included. The tall peak on the left of Figure 1 represents the
“typical” interhaplogroup comparison and is thus characteristic of the
particular mix of haplogroups included in the data collection. The peak for this collection is at 7/30
matching and thus corresponds to a very old population. The slightly smaller peak on the right
represents the “typical” withinhaplogroup comparison, mainly the comparison
within R1b, which dominates this collection.
It should be possible to gather collections whose histograms would show
more than two peaks by focusing on distinctive clades within haplogroups. However, as noted above, the information
contained in the peaks of Figure
1 is ignored in the present analysis.
Examination
of Figure 2 shows
that the “ideal” straightline MMC indicated by Equation 15 is seldom
realized. Indeed, it can be shown that,
even in the absence of population structure (such as is strikingly revealed in
Figure 1), the MMC based on Equation 10 is generally curved when the
locusspecific mutation rates are not the same.
The curvature is positive for loci with belowaverage mutation rates and
negative for loci with aboveaverage rates.
Attempting to fit straight lines to the MMCs
without taking this curvature into account would tend to compress the apparent
dynamic range of mutation rates, thus reducing the estimates of high rates and
raising the estimates of low rates.
The
results of the analysis for this collection of Y haplotypes, combined with an independent
set of 6955 20locus Ysearch haplotypes, are
presented in Table 1. These
results include the calibration of Equation 16, and the uncertainties shown
here include the contribution of the calibration and the level of agreement
between the 20 and 30locus sets.
However, the error analysis here makes no provision for the
uncertainties due to the assumption of constant rates.
Table 1.
Calibration data and final results for absolute mutation rates (per copy for
multicopy loci, indicated by asterisks). Weighted fit to 6955 20locus and
8340 30locus haplotypes.
Locus 
Calibration data 
Best
fit 
Std.
dev. 
Locus 
Calibration data 
Best
fit 
Std.
dev. 
DYS393 
0.00075 
0.00076 
0.00014 
DYS447 

0.00264 
0.00041 
DYS390 
0.00227 
0.00311 
0.00048 
DYS437 
0.00222 
0.00099 
0.00017 
DYS19 
0.00168 
0.00151 
0.00025 
DYS448 

0.00135 
0.00020 
DYS391 
0.00351 
0.00265 
0.00041 
DYS449 

0.00838 
0.00128 
DYS385
* 
0.00224 
0.00226 
0.00035 
DYS464 * 

0.00566 
0.00087 
DYS426 

0.00009 
0.00002 
DYS460 
0.00450 
0.00402 
0.00069 
DYS388 
0.00057 
0.00022 
0.00004 
YGATAH4 
0.00290 
0.00208 
0.00033 
DYS439 
0.00530 
0.00477 
0.00073 
YCAII * 
0.00000 
0.00123 
0.00019 
DYS389i 
0.00188 
0.00186 
0.00028 
DYS456 

0.00735 
0.00115 
DYS392 
0.00061 
0.00052 
0.00010 
DYS607 

0.00411 
0.00066 
DYS389ii 
0.00226 
0.00242 
0.00041 
DYS576 

0.01022 
0.00167 
DYS458 

0.00814 
0.00124 
DYS570 

0.00790 
0.00138 
DYS459
* 

0.00132 
0.00021 
CDY * 

0.03531 
0.00549 
DYS455 

0.00016 
0.00004 
DYS442 

0.00324 
0.00051 
DYS454 

0.00016 
0.00003 
DYS438 
0.00044 
0.00055 
0.00012 
The
average mutation rate for these loci, considered as 37 loci in Ysearch terms, is 0.00492±0.00074 mutation per locus per
generation. In contrast, the average
rates for the first 12 and the first 25 are 0.00187±0.00028 and
0.00278±0.00042, respectively. The
extreme cases, CDYa and CDYb,
are thus about seven times as fast as the average for all 37, while DYS426 is
about 1/50 as fast. The uncertainties
for the average rates are the statistical standard deviations, scaled such that
χ^{2} per degree of freedom is unity. They include the effect of the correlations
introduced by scaling each whole set of relative mutation rates with a common
calibration factor and by multiplecounting the multicopy loci. The uncertainties are thus about triple the
corresponding normalized rootsumsquares of the individual perlocus standard
deviations. Not surprisingly, the
12locus average has the smallest uncertainty: 11 of the 12 are “anchored” by
calibration data. Also, the perlocus
standard deviations have a component which varies as the squareroot of the
mutation rate, and thus the composite uncertainties should also be smaller for
smaller average rates.
Another
unsurprising result is that the fits agree rather well with the calibration
data – the exceptions being mainly those loci with the fewest fatherson pairs,
such as DYS388 and DYS437. These fits,
however, are at odds with previous analyses using the method of Hutchison et
al. (2004), in that the range of rates is now wider because of the proper
accounting for MMC curvature, as noted above.
Comparison with analyses of
volunteerdriven,
pedigreebased data is unprofitable because of the unknown biases associated
with the latter.
Application to mtDNA
The same
analysis can be applied to mitochondrial DNA, but there are too many base pairs
in the standard HVR sequences (1143 in the HVR1+HVR2 test offered by Family
Tree DNA) to treat them all as separate loci here. Dealing with compound loci consisting of 20
base pairs each will keep the number of loci down to a manageable 58. I downloaded 2717 such haplotypes from
Mitosearch in April of 2006 and treated them in the same way as the Y DNA
haplotypes above, except that I have no corresponding calibration data from
direct observation. The results are therefore
expressed with an arbitrary scale factor to give an average rate of 0.0001
mutation per generation per (compound) locus.
It is worth noting that the histogram of matches, shown in Figure 3, is very different
from that in Figure 1. There is only one peak in the distribution,
and the “tail” of that peak is still significantly above zero all the way to
perfect matches. In other words, perfect
matches are not at all uncommon, and there is no obvious separation of
population components.
The
results of the mtDNA analysis are shown in Table 2, with each compound
locus labeled by the position of its last base pair. Within the resolution of this analysis, four
of the loci show no mutability at all, and seven others are at or below 1% of
the average rate. (Of course, two loci are shorter than 20 base pairs, being 9
and 14, respectively, and both of these figure in the extreme low rate
lists.) At the high end, the locus
301320 is about 14 times as fast as the average.
Table 2
Normalized mutation rates for compound mtDNA loci
Last
base 
Best Fit 
Std. Dev. 
Last
base 
Best Fit 
Std. Dev. 
16020 
0.00000000 
0.00000000 
20 
0.00000138 
0.00000026 
16040 
0.00000005 
0.00000002 
40 
0.00000078 
0.00000011 
16060 
0.00001345 
0.00000133 
60 
0.00000637 
0.00000065 
16080 
0.00002193 
0.00000356 
80 
0.00025204 
0.00008546 
16100 
0.00008002 
0.00000897 
100 
0.00002812 
0.00000395 
16120 
0.00002868 
0.00000475 
120 
0.00001461 
0.00000164 
16140 
0.00022024 
0.00001290 
140 
0.00000426 
0.00000049 
16160 
0.00004133 
0.00000467 
160 
0.00044132 
0.00001566 
16180 
0.00011192 
0.00000279 
180 
0.00000350 
0.00000037 
16200 
0.00025233 
0.00001461 
200 
0.00015501 
0.00001005 
16220 
0.00007979 
0.00000200 
220 
0.00002686 
0.00000293 
16240 
0.00020343 
0.00001573 
240 
0.00003063 
0.00000322 
16260 
0.00007544 
0.00000295 
260 
0.00001839 
0.00000087 
16280 
0.00015686 
0.00000603 
280 
0.00031359 
0.00006351 
16300 
0.00035193 
0.00003702 
300 
0.00001306 
0.00000066 
16320 
0.00036923 
0.00004056 
320 
0.00143273 
0.00011906 
16340 
0.00002024 
0.00000236 
340 
0.00001092 
0.00000055 
16360 
0.00006397 
0.00000324 
360 
0.00000021 
0.00000002 
16380 
0.00005363 
0.00000566 
380 
0.00000137 
0.00000050 
16400 
0.00003918 
0.00000701 
400 
0.00000210 
0.00000025 
16420 
0.00000000 
0.00000000 
420 
0.00000496 
0.00000103 
16440 
0.00000100 
0.00000007 
440 
0.00000006 
0.00000003 
16460 
0.00000000 
0.00000000 
460 
0.00004121 
0.00000543 
16480 
0.00000540 
0.00000086 
480 
0.00007197 
0.00001360 
16500 
0.00001821 
0.00000436 
500 
0.00005288 
0.00001276 
16520 
0.00048480 
0.00005062 
520 
0.00001667 
0.00000081 
16540 
0.00001170 
0.00000168 
540 
0.00014729 
0.00000356 
16560 
0.00000002 
0.00000001 
560 
0.00000237 
0.00000030 
16569 
0.00000000 
0.00000000 
574 
0.00000058 
0.00000009 
Conclusion
In the
Spring of 2006, Family Tree DNA introduced an additional panel consisting of 30
Y DNA microsatellite loci beyond the 37 analyzed
here. At the time of this writing, there
are hundreds, though not thousands, of completed tests delivered to customers
and potentially available for analysis.
In the near future, it should be possible to apply the method described
here to these extended haplotypes and estimate the rates of all tested loci.
Web Resources
www.ysearch.org YSTR
database
www.mitosearch.org mtDNA
database
References
Dupuy, BM, et al (2004) Ychromosomal microsatellite
mutation rates: differences in mutation rate between and within loci. Hum Mutat 23:117124.
Gusmão, P, et al (2005) Mutation rates
at Y chromosome specific microsatellites. Hum Mutat 26:520528.
Heyer, E, et al (1997) Estimating Y
chromosome specific microsatellite mutation
frequencies using deep rooting pedigrees. Hum Mol Genet 6:799–803.
Hutchison,
LAD, et al (2004) Direct determination of mutation characteristics of Y
chromosome STR loci. Poster presented at the American Society of Human Genetics
2004 Annual Meeting, October 2004,