Dating Y-DNA Haplotypes on a Phylogenetic Tree: Tying the
Genealogy of Pedigrees and Surname Clusters into Genetic
William E. Howard III and Frederic R. Schwab
An RCC matrix (Howard 2009a), resulting from a new correlation approach to analyze
Y-DNA haplotypes, is used in conjunction with a standard Mathematica application
program to produce a dated phylogenetic tree. The program displays the evolutionary
relationships among all haplotypes in the matrix; it groups closely related surnames into
family clusters that correlate well with genealogical pedigrees. The time scale assigned to
the tree is monotonic, linear, and dates the evolutionary relationships of Y-DNA testees
that may go back tens of thousands of years. This study is arguably the first to investigate
the time relationships between surname Y-DNA haplotypes, pedigree- and RCC matrix-derived
surname clusters and their associated phylogenetic tree. It offers a straightforward
methodology and a uniform time scale that can also be used to estimate the evolutionary
relationships among Y-DNA haplogroups.
Address for correspondence: William E. Howard, [email protected]
Received: February 6, 2011; accepted: August, 2011.
A new approach to analyze Y-DNA haplotypes has been introduced (Howard 2009a,
Howard 2009b). An RCC time scale, calibrated with over 100 pedigrees, has been
developed that can be applied over tens of thousands of years to investigate the
evolutionary relationships that tie genealogy and genetics together by analyzing surname
clusters and haplogroups.
More recently these correlation techniques and genealogical pedigrees have provided
additional insight into the history and evolution of the Gordon surname (Gordon and
Howard 2012, hereafter called the Gordon paper). That paper presented relationships
among pedigrees, surname cluster and subcluster membership, and possibly geographic
location. Available pedigrees are associated with membership in subclusters and clusters.
The ISOGG time estimates, the RCC time scale, and the Y-DNA evidence can be used
to suggest places where surnames originated. The times derived for early Gordon
surname migration are consistent with the history of those places derived from
archaeological excavations, genetics and anthropologic studies. A comparison of the
ISOGG dates and the RCC time scale shows good agreement and no inconsistency
between the RCC- and ISOGG-derived estimates. Moreover, there is no evidence of nonlinearity,
and the time scale appears to be valid for use over time intervals out to over
50,000 years, making it potentially valuable when genetic considerations are important to
studies of mitochondrial DNA, migration and linguistic patterns, geology, anthropology,
paleontology and archeology.
The traditional process of analyzing family groups and clusters from Y-DNA haplotypes
is challenging. If the number of testees is small, the results are statistically ill-determined
and therefore ambiguous; if the number is large, the analysis becomes arduous. While the
description of the evolution of surname clusters appears to be correct in the above three
papers, the process lacks a quick ordering of the cluster groupings as well as a good
presentation of the time scale of the evolution. To provide that confirmation, we looked
into the availability of a computer program that would have one or more of the following
1. It would group entries into family surname clusters and subclusters.
2. It would derive evolutionary relationships among the clusters and subclusters.
3. It would derive a phylogenetic tree from the evolutionary relationships.
4. It would attach a time scale to the phylogenetic tree.
5. It would be unambiguous in its presentation.
6. It would execute quickly.
7. It would be available at reasonable cost.
8. It would be easy to use.
One of us (FRS) uses the computational software program, Mathematica (Wolfram 2010)
in support of research related to radio astronomy and suggested that its built-in package
for hierarchical clustering and dendrogram generation would address some of the
problems we wished to solve. This paper describes the use of this package, including
preparation of the input data, and the success we have had with it in confirming and
extending previous results.
THE MATHEMATICA PROGRAM – DATA PREPARATION
The result of a Y-DNA test is expressed as a haplotype or marker string, which, in this
study, is 37 markers long. When the haplotypes of n testees are selected for study, we use
the correlation function (e.g., Excel for Mac 2004) to derive an n by n matrix of
correlation coefficients that is then converted to an RCC matrix (Howard 2009a, Howard
2009b). That n by n matrix is inserted into Mathematica which produces a phylogenetic
APPLICATION OF THE MATHEMATICA PROGRAM:
The so-called agglomerative hierarchical clustering methods, as implemented in
Mathematica, begin with a square n by n distance, or dissimilarity, matrix, D, the i-j
element of which contains a measure of the distance d(i,j) between objects i and j (i.e., in
our case, the ith and jth set of marker data for the n individuals) according to some
appropriately defined metric. The Euclidean distance might be chosen, or the correlation
distance, d(i,j)=1-ρ(i,j) where ρ is the Pearson correlation coefficient. In our case we
choose the RCC matrix, with elements 104(1/ρ(i,j)-1), as defined in the earlier references
(Howard 2009a, Howard 2009b), and for which we have an associated time scale.
Next, for splitting the data into clusters, a measure of intercluster distance needs to be
defined; this is specified via the so-called linkage option. There are a number of standard
choices for the linkage option, seven of which are built into the Mathematica clustering
package; it can also be user-defined. We have selected Mathematica's linkage-
>``average'' method, which coincides with what is known in the literature as the
unweighted pair group method using arithmetic averages (UPGMA). See Press et al.
2007. The hierarchical clustering proceeds in a bottom-up fashion, beginning with a list
consisting of n singleton clusters. Then the following steps are repeated n-2 times:
(1) find the two nearest clusters, according to the intercluster distance measure
associated with the chosen linkage option;
(2) create a new cluster that agglomerates the two;
(3) update the active cluster list accordingly; and
(4) go back to (1).
One ends up with a phylogenetic tree, the structure of which can be displayed in a so-called
A good textbook reference, in addition to Press et al. (2007), is Everitt (2001). And
full algorithmic details, including analyses of computational complexity, are given in
Day & Edelsbrunner (1984).
We first used the full 187 by 187 RCC matrix that contained all 37-marker Y-DNA
Gordon surname test results, grouped by haplotype, as input to Mathematica. The initial
output, using the program’s default setting, presented a well-formed phylogenetic tree.
The program grouped the testees into exactly the same clusters derived in the Gordon
paper. The output contained the Kit Numbers of each testee, many of which are not
members of the major clusters. Some were in minor clusters; some in no clusters at all.
The program output using the full matrix did not initially contain a time scale, but it
clearly fulfilled the first three attributes we were seeking. To investigate the time axis of
the tree, we compared the average value of each Gordon cluster with its height along the
ordinate. We found that the scale was linear (R2 = 0.9997), so values of RCC could be
safely assigned to the ordinate, presenting a time axis in which 10 RCC ~ 433 years
A comparison of each of the seven trees with the evolutionary diagrams in Figures 4A
and B of the Gordon paper showed that the average linkage option gave the best
agreement. This result was not unexpected since averages were used in the Gordon paper
to derive the intercluster relationships. To investigate how well the time axis of the tree in
the average linkage option fit the data in the RCC matrix, we measured the lengths along
the RCC axis of the intercluster junctions and compared them with the average values of
each Gordon intercluster at their RCC junction points found in the Gordon paper. We
found that the relation was:
RCC (tree) = 0.962 x (RCC, matrix), with R2 = 0.826.
This agreement between the RCC values for the tree and the matrix RCC values for the
interclusters gave us confidence that the average linkage was the option to be used. Thus
we conclude that the RCC matrix, using Mathematica with the average linkage option,
will yield a phylogenetic tree that will show the evolution of the modern Gordon surname
at least ten times further back in time than pedigrees and cluster designations can do.
Moreover, this application of the Mathematica cluster analysis program will show the
evolution of any surname matrix for which a sufficient number of testees is available.
We next generated a tree using the average link option only for the 119 specific clusters
described in the Gordon paper. Figure 1 shows the results.
Figure 1: The Phylogenetic Tree Produced by Using the 119 by 119 RCC Matrix with
Mathematica’s Average Linkage Option for all Gordon Y-DNA 37-marker Haplotypes
that formed clusters in the Gordon paper4. The ordinate shows for each testee his Kit
Number, Gordon Cluster assignment, and Haplotype. The abscissa shows the RCC time
scale (10 RCC ~ 433 years).
A subset of Figure 1 can be used to present the subclusters of the Gordon Cluster A. This
result is presented in Figure 2. It is particularly interesting for several reasons:
1. It contains evolutionary information over time intervals of genealogical interest.
2. It contains a large number of testees within the single Gordon Cluster A.
3. It contains subclusters that show evolutionary branching, and,
4. It contains the best pedigree information outlined in the Gordon paper.
Figure 2: A Detail of the Gordon Cluster A in Figure 1. The abscissa shows the RCC
time scale (10 RCC ~ 433 years). The ordinate shows for each testee his Kit Number
identification, his Gordon Cluster and Subcluster assignment and his haplotype.
At the bottom right of Figure 2 we find that Gordon subcluster Aa actually consists of
two sub-sub clusters (Kit numbers 45694 to 36388 and Kit numbers 129211 to 87584).
The existence of the Gordon A subclusters was not recognized originally on the Gordon
FTDNA web site and the existence of the two sub-sub clusters was not originally
recognized in the RCC matrix (Gordon and Howard 2011). The remarkable ability to
recognize and resolve surname clusters with such high resolution shows the power of the
hierarchical clustering methods as implemented in Mathematica.
To investigate how well the time axis of the Gordon Cluster A tree in the average linkage
option fit the data in the RCC matrix, we measured the lengths along the RCC axis of the
intersubcluster junctions and compared them with the average values of each Gordon
intersubcluster using values of the RCC junction points found in the Gordon paper. We
found that the relation was:
RCC (tree)= 1.0213 x RCC (matrix), with R2 = 0.48.
Again, we find that the time relationship is linear, but the scatter is larger than for the full
matrix due to mutation uncertainties over this short time interval of about 1100 years.
For the pedigree relationships in the subclusters Gordon Cluster A, see the Gordon paper.
The 68 testees who were not designated as cluster members can still be presented in a
phylogenetic tree. Although not in clusters, their positions on the tree with other testees
gives valuable insight into points in the tree where they share common ancestry with
cluster members and all other testees. With few exceptions, their TMRCAs lie at RCC >
20-25, as expected. Figure 3 presents the phylogenetic tree for the full 187 x 187 RCC
Figure 3: The Phylogenetic Tree Using the full 187 by 187 RCC Matrix and
Mathematica’s Average Linkage Option for all Gordon Y-DNA 37-marker Haplotypes4.
The ordinate shows for each testee his Kit Number, Gordon Cluster assignment, and
Haplotype. The abscissa shows the RCC time scale (100 RCC ~ 433 years).
DISCUSSION OF RESULTS:
This investigation is arguably the first to study the time relationships among surname YDNA
haplotypes, pedigree and RCC matrix-derived surname clusters shown in a dated
phylogenetic tree. Our conclusions are based only on the 37-marker haplotype data that
met the selection criteria (Gordon 2012). More data on Y-DNA haplotypes and family
pedigrees will lead to more certain ancestral relationships.
Some of the Gordon clusters in Figure 2 have pedigree assignments (Gordon 2012). Since
Mathematica’s hierarchical clustering algorithm grouped those clusters together just as
they were grouped in the Gordon paper, we conclude that as more pedigrees become
available, each major cluster will be representative of a particular pedigree line. The
TMRCAs of pairs of testees in Figure 2 agree well with those in Figure 4 of the Gordon
paper. This result clearly meets the first six attributes we were seeking.
Figure 2 of the Gordon paper showed three peaks in the histogram of all Gordon surname
pairs. The first peak has a maximum at RCC ~ 8 and a minimum at RCC ~ 24,
corresponding to dates of about 1600 and 900 CE, respectively. These are typical RCC
limits for major surname clusters. The phylogenetic trees clearly show surname clusters
that each have TMRCAs in that same time interval.
Each cluster has its own TMRCA, and the tree shows estimates of the epochs when the
TMRCAs of each pair of clusters lived. These are the intercluster TMRCAs (using the
terminology of the Gordon paper). Pedigrees rarely trace to epochs before 1000 CE and
this is about the earliest time when the presence of surnames can be traced. However, the
inclusion of testees that have RCC connections >20-25 (i.e., those that were not assigned
to clusters) shows when these non-cluster Gordons shared TMRCAs with all other
Earlier than 1000 CE information on pedigrees is nonexistent, but the Y-DNA record in
the phylogenetic tree indicates clearly the approximate dates when these family groups
and clusters joined at a progenitor ancestor. Those dates coincide with the second peak in
Figure 2 of the Gordon paper, which has a maximum at RCC ~ 48, corresponding to
about 2000 years ago (~ 100 BCE). The phylogenetic tree shows the joining of clusters in
the time interval between 950 and 12,500 years ago, leading up to the estimated time
when the representatives of Haplotypes I and R had their most recent joint ancestor.
These results strongly indicate that when all Gordon testees are included in a
phylogenetic tree, the testees that were ungrouped by pedigree or RCC value can still be
presented in an evolutionary sequence, except that their TMRCAs with other testees may
be located further back in time. As more testee results become available, more clusters,
more pedigree information and more insight into the evolution of the Gordon surname will
It is important to realize that, while the surname of these testees is Gordon, the figures
trace the relationships of their Y-DNA, through the mutation process, back in time before
surnames were chosen. Figure 3 suggests that the progenitor Gordon Haplogroups I and
R probably lived about 10-11,000 years ago. Mutations at that time caused these two
haplogroups to split, with descendants of Haplogroup I going to Scandinavia, then to the
UK, and descendants of Haplogroup R going to Western Europe, then to the UK. Many
of these descendants, carrying the mutated markers of the progenitor, began living in
Scottish clans or groups without knowing their Y-DNA heritage. The names of those
clans converged to the Gordon surname less than 1000 years ago, and their descendants
now live in Scotland and all over the world.
The analysis that led to these results for Gordon testees can now be applied with
confidence to the analysis of other surnames, other haplotypes, and the haplotypes of
different haplogroups. The RCC time scale can be applied to all these associations but
mutations will still cause standard deviation errors of the order of 20-40 percent over time
intervals of tens of thousands of years (Howard 2009a).
Mathematica, together with its hierarchical clustering package, has met the first four of
the six desirable goals described in this paper’s Introduction. The costs of the program in
2010 vary upwards of $295 USD. Other programs with phylogenetic clustering
capabilities are free, but may have ease-of-use issues. The use of Mathematica described
here is only a small subset of its program capability. We hope that the description of the
process we used to produce the figures in this paper will enable other researchers and
surname administrators to use Excel-type programs to form the RCC matrix. The use of
an application like Mathematica, in conjunction with the RCC matrix, opens several lines
of analysis that have not been available to surname administrators and to the testing
agencies in the past. The application:
· Matches haplotypes with clusters much more efficiently than can be done by
· Automatically forms a phylogenetic tree showing evolutionary relationships
among all testees, including the TMRCA of individual pairs of testees, the
TMRCAs of pairs of clusters, and members in different haplogroups.
· Automatically applies a calibrated time scale to the phylogenetic tree.
· Indicates the times when haplogroups undergo major mutations that spin off
· It puts them on a uniform time scale whereas previous efforts tended
to estimate haplogroup ages through studies of variance, one by one.
· Encourages comparisons between the genetic chronology of Y-DNA testees
and events in human history.
· Allows a better understanding of surname histories and haplotype evolution.
When this series of papers began, we suggested that the correlation approach should be
used in conjunction with the traditional approach when surname administrators attempt to
group markers by eye. We now have evidence that the correlation approach can produce
better, more uniform results. Moreover, when the correlation approach is used in
conjunction with an application program like the Mathematica clustering package, it
produces not only a time scale, but yields even better results because of its ability to show
a higher degree of resolution into clusters, subclusters and, now, even sub-sub clusters.
The approach appears to be valid and can be applied uniformly over long time intervals
of tens of thousands of years, using an extrapolation derived from over 100 pedigrees.
The time scale is linear, and we expect that it can be used to study the evolution of
haplotypes in the distant past.
FamilyTreeDNA uses a TIP process involving STR marker-specific mutation rates to
estimate the time to the most recent common ancestor of any designated testee pair within
a specific number of generations based on their STR marker strings. To convert this to
time, FTDNA suggests using 25 years per generation. The FTDNA method contains
possible errors in each of the 37 individual mutation rates and additional uncertainties in
the number of years per generation. Moreover, since the results of a pair of testees, even
from very different haplotypes, will have approximate dates connecting their ancestors in
the phylogenetic tree, that information on many pairs is missing from FTDNA’s
In contrast, the correlation approach adopts an average mutation rate over all 37 markers.
Uncertainties in the average mutation rate over all markers will have less influence on the
resulting time estimates than in the TIP process. It enables the calculation of time
relationships between any testee and all other testees in the RCC matrix or on the
phylogenetic tree. The applications program places the testee relative, not to just another
testee, but to all testees in the matrix or tree. This process of showing how a testee relates
in time and evolutionary position to every other testee on the phylogenetic tree is a
We hope that the description of the process we used to produce the figures in this paper
will enable other researchers and surname administrators to use Excel-type programs to
form the RCC matrix and use a program such as Mathematica to perform the same types
of analyses from the matrix or the tree that we describe in this paper.
The current status of this investigation suggests the following areas that might be
1. The correlation approach should be used to study the haplotypes of
different haplogroups. The time scale on a phylogenetic tree of
haplogroups generated by Mathematica should then be compared to the
times assigned to the different haplogroups on the ISOGG phylogenetic
2. Now that the Mathematica hierarchical clustering program has been
shown to generate surname clusters that can be ordered in an evolutionary
sequence and can be identified by pedigree line, a more detailed study of
individual marker changes should give us additional insight as individual
marker differences are found to be associated with differences in cluster
membership and their evolution.
3. Work is needed to investigate:
a. The time relationships between SNPs and STRs (the marker
b. The optimum method to determine the age of a haplogroup using
the RCC approach.
c. The degree of agreement between the phylogeny of the ISOGG
and the Mathematica-generated phylogenetic trees.
We thank Tei A. Gordon, the FTDNA Gordon surname administrator, who emphasized
the need to objectively analyze Y-DNA data and to minimize issues of integrity that
involve individual pedigree-grouping relationships. We thank J.J. (Jim) Logan for
discussions involving differences in phylogeny seen between the ISOGG tree and the
comes from the existence of molecular clocks (i.e., constancy in time of the rate of change of a given
molecule), for which there is considerable empirical evidence. Indeed, at the molecular level, the evolution
of a gene is to a large extent a random walk in sequence space, with a well-defined clock….” We believe
that such a clock is present and can be used to date haplotype differences, particularly over long intervals of
time when mutations tend to average out. We use the average mutations over a relatively large number of
markers (37) and we have calibrated the time scale using different families and over 100 pedigrees. Thus
we believe that our method is relatively well-shielded from criticism. Moreover, since there is no indication
of non-linearity in the time scale, and since the largest RCC (~1300) found in our analyses is close to the
time when Y-Adam left Africa, we believe that our RCC time scale is a reasonable “molecular clock.”
Mathematica-generated tree. We are grateful for discussions about the difficulties that
surname administrators encounter when they use traditional methods to group testee
results into family clusters.
Day, H. E. and Edelsbrunner, Herbert, Efficient Algorithms for Hierarchical Clustering
Methods, Journal of Classification, 1, 1984, pp. 7-24.
Everitt, Brian S., Landau, Sabine, and Leese, Morven, Cluster Analysis, Fourth Ed.,
Wiley, NY, 2001.
Gordon, Tei A. and Howard, William E. III, The Evolution of the Gordon Surname: New
Insight From Y-DNA Correlations and Genealogical Pedigrees, Journal of Genetic
Genealogy, 2012, this issue.
Howard (2009a): Howard, William E. III, The Use of Correlation Techniques for the
Analysis of Pairs of Y-Chromosome DNA Haplotypes, Part I: Rationale, Methodology
and Genealogy Time Scale, Journal of Genetic Genealogy, 5, No. 2, Fall 2009, p. 256.
Howard (2009b): Howard, William E. III, The Use of Correlation Techniques for the
Analysis of Pairs of Y-Chromosome DNA Haplotypes, Part II: Application to Surname
and Other Haplotype Clusters, Journal of Genetic Genealogy, 5, No. 2, Fall 2009, p. 271.
Press, William H., Teukolsky, Saul A., Vetterling, William T., and Flannery, Brian P.,
Numerical Recipes: The Art of Scientific Computing, Third Ed., Cambridge Univ. Press,
Cambridge, 2007; see Section 16.4, Hierarchical Clustering by Phylogenetic Trees, pp.
Rammal, R., Toulouse, G., Virasoro, M.A., Ultrametricity for Physicists, Reviews of
Modern Physics, 58, No. 3, July 1986, pp. 765-788.
Wolfram (2010): Wolfram Research, Inc., Mathematica, Version 8.0, Champaign, IL,
 The ISOGG’s Y-DNA Haplogroup Tree can be found at <http://www.isogg.org/tree/index.html>. It is
being continually updated.
 Note in proof: We now have a code that will produce a phylogenetic tree directly from haplotypes.
 The seven options for the linkage method are: single, complete, average, centroid, median, weighted
average, and Ward. The precise definitions of these options are not provided in the Mathematica
documentation. Clear definitions are given by Day and Edelsbrunner (1984), as well as in the
documentation for the Matlab hierarchical cluster package, available at
The plots with the selection of average, centroid, or weighted average were considered to have time, or
RCC axes closer to what we might have expected.
 In April 2010, the cutoff date of the Gordon paper analysis, 242 Y-DNA results were available in the
Gordon surname project (Note 4). We selected only those results where testees had been tested at 37 or
more markers and we use only the first 37 markers to form the correlation matrix and then the RCC matrix
(Howard 2009a). This process narrowed the analysis to 187 individuals from which we were able to group 119 testees (64%)
into well-defined Gordon clusters and subclusters (viz., clusters within a cluster) in the RCC matrix.
 See the Gordon DNA Project at <www.TheGordonDNAproject.com>
 We compared the RCC values from the matrix in the Gordon paper with measured values of RCC from
the phylogenetic tree and derived this relationship. One might expect that there should be an exact
comparison of these values, but the presence of small measuring errors on the graphs combined with the
difference in the way Mathematica derived the tree led to the small discrepancies noted here. Those
discrepancies are small compared to the effect of unknown mutations, especially at low values of RCC
present in the time interval of genealogy.
 We define a major cluster as an RCC grouping that must contain at least four different testees so that at
least 6 pairs ((4 x 3)/2 = 6) will be available for comparison. This process led to the identification of 10
major Gordon Clusters, A, C, D, and E (in Haplogroup I1), H, K, L, Q and T (in Haplogroup R1b1b2) and
Cluster G (in Haplogroup I2b1). The members of each major Gordon cluster are given in Appendix B of
the Gordon paper.
 Since this paper was submitted, we recognized that the junction points on the tree corresponding to dates
20,000 or more years ago may be underestimated and that the tree did not always indicate the same
sequence of phylogeny that the ISOGG tree reports. Since the dates in this paper refer to more recent times,
the dated positions on the tree out to 20,000 years should be approximately correct. Work to investigate
trees within haplogroups that have TMRCAs located at earlier dates is in progress.
 For a description of the TIP process used, see the FTDNA web site at
 Mutations are used explicitly in the FTDNA TIP process and they are implicit, but still present in the
correlation approach. The TIP process is proprietary, but the probabilities included in their estimates are
probably of the same order as the uncertainties in the correlation method expressed in years.
 On page 871 of Press et al, 2010 it is stated that an ultrametric tree has the property that the time distance
of all testees in a cluster from the MRCA of the members of a cluster is the same for all the cluster testees.
This is clearly the case here where the path length on a phylogenetic tree denotes evolutionary time.
Further, it was proposed in the early 1960s that accepted mutation rates might be close enough to constant
that, at the molecular level, actual evolutionary data might be close to ultrametric, i.e., that there was a
“molecular clock.” In Rammal et. al 1986, it is stated “ An additional superiority of molecular phylogeny.
 A SNP is a Single Nucleotide Polymorphism that is used to determine the phylogeny of an evolutionary
sequence. See http://www.isogg.org/wiki/Single-nucleotide_polymorphism.