Geographic
Patterns of Haplogroup R1b in the
Kevin D. Campbell
Abstract
The recent availability of Y-
Address for correspondence: [email protected]
Received:
Introduction
While
The lack of reliable data can be
attributed to a number of causes. These
include: inconsistent use of markers and nomenclature, cost involved with
extensive panels of markers, and a number of other issues that are familiar to
most academic and amateur genetic researchers.
However, it is suggested that there are two root causes that have
significantly hindered population analysis – (1) the lack of uniformly
collected, independently verified data sets and (2) the tendency of researchers
to shield and obfuscate[1]
their analysis.
The major factor contributing to
the first is the nature of the submission process at the largest databases,
which results in non-validated and unsubstantiated data. The sad truth is that many of this field’s
largest databases rely on data and geographic inputs provided by enthusiastic,
but uninformed individuals. While
transcription and upload errors for YSearch and SMGF may be very low, many
errors creep in related to marker translation and geographic speculation. Though some might dismiss marker translation
as insignificant, errors in geographic data are not. In fact the essence of population analysis is
the deduction and inferences that can be made about how the Y-STR data relates
to specific geographic areas or population migrations.
As an example, in the
The aforementioned statistics call
into question the reliability of information provided by individual
participants and suggests that possible organized and structured data gathering
studies might provide better sources of data.
However, some respected researchers have published popular texts with
new theories while providing insufficient information or omitting critical
linkages that might facilitate a formal and critical review.
For example, though Capelli’s
study of
Though Bryan Sykes has made the
data available that he used in his recent popular book “Blood of the Isles”, he
also leaves out critical linkages between the particular haplotypes and
conclusions that he draws in the text.
This is particularly frustrating since while the haplogroups in his
study are reasonably distinguishable, the assertions and theories that he makes
about specific subclades are not easily reviewed in the context of the
supporting data. The present article
will fill in some of this missing information.
Similarly, most people would
consider the recent book, The Origins of the British, by Stephen
Oppenheimer’s (2006) to be an authoritative work in this rapidly evolving
field. While a substantial portion of
Oppenheimer’s Y-STR study also uses the Capelli data, he assigns new haplotype
labels to the data – “16 distinct types
of R1b” -- without ever providing the detail necessary to link his
haplotypes to the underlying data. As is
the case with Sykes’s work, this makes any real academic or juried review of
his conclusions impossible and lessens the usefulness of his work to other
researchers. In some cases like the
aforementioned Pict test, researchers have been quick to partner with testing
companies to make money from their private theories.[4]
The purpose of this paper is to
take a look at the underlying Sykes R1b data and see if it can be linked to his
founder haplotypes and the conclusions of his analysis. The goal of this paper is to attempt to
provide additional insights to the work of these researchers to make it more
useful to the individual genetic genealogist who look to their data as a link
to the past.
Methods
This study focuses on Haplogroup
R1b, which comprises the vast majority of the
Results
Step 1 - Coding of the OGAP Data
For data collection, Oxford
Genetic Atlas Project (OGAP) data was downloaded from Bran Sykes web site and
converted from PDF to Microsoft Excel format.
2,322 samples were then coded by haplogroup using
Sykes included the following
description of the data in the supplementary data file:
Y-chromosome DNA
(yDNA) - Samples collected early in
OGAP were amplified across the following seven markers: DYS 19, DYS 389i,
DYS389ii, DYS 390, DYS 391, DYS 392, DYS 393 using conditions described by de
Knijff et al (International Journal of Legal Medicine, 110: 134-140, 1997).
Later samples were typed for these and three additional markers: DYS 388, DYS
425 and DYS 426, using the two-stage multiplex conditions described by Thomas
et al. (Human Genetics, 105: 577-581, 1999). Alleles are reported as the number
of repeat units. For reasons of continuity within OGAP, DYS 389i is reported as
three repeats lower than the allele size produced by the ABI 3100. DYS 398ii-i
reports the difference between 389i and 389ii, the reason being that the repeat
size at 389ii is not independent of 389i whereas the difference between them
is. Although Y-chromosomes were assigned to clades, largely by RFLP [Author
- restriction fragment length polymorphism] analysis, these
assignments are not reported here as they do not necessarily correspond to the
SNP-based system recommended by the Chromosome Consortium (Genome Research
12:339-348, 2002).
Geographical
distribution - Genetic data are assigned to
geographical regions based on the birthplace/residence of the paternal
grandfather. This was done to minimize the effect of very recent migration. The
regional boundaries are shown on a map which precedes the Prologue in Blood of
the Isles. These data are copyrighted
and must not be reproduced without permission. Other formats and additional
details may be available for academic collaborations.
Several things are worth noting
about the data. First, the Sykes data
only uses 10 markers (DYS19, DYS389i, DYS389ii, DYS390, DYS391, DYS392, DYS393,
DYS388, DYS425 and DYS426). In addition,
only approximately 64% of the data are complete with 36% of the data missing
four markers, DYS439, DYS388, DYS425 and DYS426. While the missing markers appear to be a
serious shortcoming, 94% of all the DYS425 and DYS439 markers and 73% and 74%
of the DYS426 and DYS388 markers in Sykes full data set, respectively, have a
value of 12.[5] This means that these markers do not have
sufficient spread and variability and are, in general, of limited use in
discriminating between haplotype patterns within this set of data.
Another interesting fact is the
haplogroup distribution of the data. Due
to lack of haplogroup designations in the original data, Athey’s haplogroup
calculator was used as a proxy to classify each haplotype. With several haplotypes being removed because
of missing data, Table 1 shows a comparison of the results of the
haplogroup calculator with Sykes published “Clans.”
In this table, percentages shown
in the middle column reflect the output of Athey’s calculator while the
percentages in the far right column correspond to the breakdown published by
Sykes in Appendix C of his book.
It is clear from this comparison
that Athey’s calculator appears to classify the data in similar proportions as
Sykes Clans and thus one may infer the underlying meaning of Sykes’ clan
nomenclature.
In addition, it is
important to note that Syke’s Clan categorizations were not based primarily on
single nucleotide polymorphism (SNP) testing.
As stated in the italicized quote above, “Y-chromosomes were assigned to clades, largely by RFLP analysis, these
assignments are not reported here as they do not necessarily correspond to the
SNP-based system recommended by the Chromosome Consortium.”
Finally, one should understand the
regional borders that Sykes uses in his study.
Since the purpose of this analysis is to draw geographic inferences, we
are limited in our insight by the definition of the geographic areas from which
the data is collected. Figure 1 shows
the regional borders that are coded in the OGAP data.
Table 1. Calculated Haplogroups vs. Sykes Clans
Step 2 – Extracting R1b (Oisin) Data
The 1625 haplotypes identified as
R1b in the previous step were extracted from the data set. These included all of the R1b haplotypes
shown in Sykes Table 1, plus those for
Figure
1. Regional Borders Used in the OGAP
Analysis to Classify Individuals
The first thing that was done to
better understand the data was to identify modal haplotypes for each region as
a descriptive view of the R1b data set.
However, examination of the modal haplotypes for the individual regions
was not informative because all regions and the full data set matched the
standard Atlantic Modal Haplotype.
A view of the R1b data in the form
of a connected graph – as shown in Figure 2 – shows a high degree of
“cubism.” By this I mean a high degree
of nodal interconnectivity among the data points that results in opposite
vertices “washing out” differences in the data.
Clearly, data analysis based upon
unique combinations of markers (i.e. haplotypes) instead of individual markers
would be necessary.
Figure
2. Network Analysis of the Top Twenty
R1b Haplotypes
Step 3 – Analysis of Haplotypes
Since descriptive statistics
tended to “average out” differences in the data, other methods were needed to
identify patterns and analyze the data.
To do this the most common haplotypes were identified and two methods of
analysis were performed. Appendix A
shows the haplotypes in the OGAP data.
The OGAP haplotypes roughly follow the frequency distribution of YSearch
and in McEwan’s (2007) groups if one takes into account that the OGAP data is
light on Irish samples in comparison with these other sources. [6]
The OGAP designations in this
table were assigned sequentially in decreasing frequency of occurrence. The OGAP numbers from this table will be
referenced throughout the remainder of this report.
Two types of analysis were
conducted to identify patterns – affinity analysis and network analysis.
For affinity analysis of the
haplotypes, an Excel spreadsheet was developed to look for patterns and
anomalies in the data. An algorithm was
created that took as an input the 10 marker values for a haplotype or signature
to be reviewed and then compared that haplotype to the R1b subset of the
database. The algorithm calculated the
genetic distance and reported back the number of perfect (i.e., zero distance)
matches by OGAP region. To account for
the differing level of sampling in each region (e.g., small for
Table 2. Example of Identifying
OGAP8 Affinities
For example, haplotype OGAP8 which
is generally considered the quintessential Irish haplotype has 34 perfect
matches in the database. Because
This analysis was repeated for the
top 20 haplotypes in the OGAP data.
These haplotypes, which cover 60% of all OGAP samples, appear sufficient
to identify major regional affinities. Analysis of additional haplotypes would
be increasingly subject to sampling error.[8]
The results of the analysis of the
top 20 haplotypes are shown in Table 3.
It should be noted that in this table negative values were removed to
reduce the clutter and significant geographic anomalies were color coded to aid
in identifying tendencies. Finally,
especially interesting results were boxed to help in latter discussion in this
paper.
License was also taken in the
reordering of rows and columns of the table in an attempt to group similar
haplotypes and close regions. While such
analysis is called “affinity analysis” and can be conducted mathe-matically,
this analysis was done manually to better allow for subjective considerations
of the data.
The second type of analysis that
was performed was network analysis. This
analysis which is common in the genetic sciences was conducted using the Fluxus
Networking program version 4.2.0.0. Figure
3 shows the results of the network analysis for the top twenty OGAP
haplotypes.
It should be noted in Figure 3
that nodes have been relocated and line length changed for increased
readability. Nodes have also been
colored to reflect the regional affinities identified in Table 3.
Conclusions
Analysis of the Oxford Genetic
Atlas Project data has yielded interesting results. The combination of the geographic affinity
results shown in Table 3 and the network analysis results shown in Figure 3 are
synthesized in Figure 4. In this
graphic, key haplotypes with strong regional affinities were placed in their
rough geographic perspective. No attempt
was made to force every haplotype somewhere on the map as it is obvious that at
this level of analysis, some haplotypes are pervasive and ubiquitous and not
easily generalized to a single geographic region.
Table 3 – Haplotype Affinity by
OGAP Region
Once located, haplotypes that
differed by a single mutation were connected with lines. Figure 4 reflects the general
interconnectivity resulting from the network analysis of Figure 3. The lines in Figure 4 should be
thought of as one possible path of migrations – not necessarily the only
path. The interconnections shown in Figure
4 are not based on any individual mutation rates. The interconnections shown in Figure 4
are based upon the occurrence of mutations, the principle of parsimony, and the
general south-to-north flow of R1b discussed by Sykes and Oppenheimer. Parsimony, in this case, reflects the
generally acknowledged flow from higher concentrations of haplotypes to lower,
more diffused concentrations.
Figure 4. Geographic Patterns of R1b in the
Some of the observations and
conclusions of this analysis are as follows.
1. The methodology clearly identified and quantified
what has been previous called the Irish subclade. Whether called the Irish Modal Haplotype or
the “Ui Neill haplotype” as in the
2. Similarly OGAP10 which shares the
3. Interestingly, OGAP5 is a very
prevalent haplotype that also shows up predominately in
4. OGAP19 is interesting in that it
shows an extreme correlation with both
5. OGAP4 is particularly
intriguing. It is ubiquitous across all
areas of
6. OGAP6 is prominent in Argyll and
the
7. OGAP9 and OGAP11 both show an
affinity for both the Northern Isles and the Borders regions. This affinity is distinctive, but the author
is unqualified to venture a theory that might explain this geographic
discontinuity.
8. OGAP13 and OGAP17 both show a
clear affinity for
9. OGAP14, OGAP16, and OGAP20 all
show a common regional affinity. Though
their presence in Tayside is very slightly higher than in
10. OGAP7 seems to be most prevalent
in
11. The core haplotypes for the full
12.
13. Like
Summary
Through the analysis of Sykes’
OGAP data, this study has provided a means linking DNA results to haplotypes
and conclusions in Sykes’ book, “Blood of the Isles.” The study has confirmed Sykes’ interpretation
of the data, and hopefully, provided a means for other researchers to further
validate and extend his work. The study
both confirmed some subclades identified by Sykes as well as identified some
new subclades worthy of further research.
Key subclades that the study posits and which are defined by Sykes
include those of the Picts and the Dal Riada Celts.
Picts –
It is asserted that OGAP4 best represents the Pictish ancestry of
Dal Riada Celts – When considered in a narrow genetic sense, the Gaels of Ireland, as
identified by the
Several interesting clusters were
identified that show geographic affinities but discontinuities. Scotish clusters OGAP9 and OGAP11 have a strong
presence in the Borders as well as the Northern Isles. English clusters OGAP14, OGAP16, and OGAP20
show a predisposition to both
Electronic Database Information
Capelli, C. et al. 2003 Data Set: http://freepages.genealogy.rootsweb.com/~gallgaedhil/Capelli.htm
http://www.bloodoftheisles.net/results.html
John
McEwan’s R1b Haplotypes:
http://www.geocities.com/mcewanjc/p3modal.htm
Whit Athey’s
Haplogroup Predictor:
References
McEwan J (2007) Phase 3 analysis: Ysearch 37 STR modal
summary and analysis tables (web site).
Oppenheimer S (2006) The Origins of the British - A Genetic
Detective Story, Constable and Robinson,
Sykes B (2006) Blood of the Isles: Exploring The Genetic
Roots of Our Tribal History. Bantam
Books. Published in the
Appendix A – OGAP
Haplotypes
The 1625 data points that comprise
the R1b data set include 291 separate haplotypes. However, 50% of the data can be accounted for
with only 10 haplotypes; 60% by 20 haplotypes; and 68% by 30 haplotypes. In addition, haplotypes beyond the top 30
have only single-digit frequencies compared to the most frequent – the Atlantic
Modal Haplotype (AMH), which occurs 262 times in the data.
Below are the top 50 haplotypes in
order of descending frequency. These are
number OGAP1 through OGAP55 for reference in this study and represent all
haplotypes that occur more than 5 times.
The distribution of these haplotypes in Sykes’ OGAP data and in YSearch
(www.ysearch.org) as of December 2006) is
shown on the right hand side of the chart.
Also, a mapping of John McEwan’s
(2007) R1b subclades is shown on the left hand side of the chart. The letters designating the McEwan group
refer to the groups described in Appendix B that were assigned when McEwan
individual haplotypes were grouped together to reflect the much smaller number
of markers in the OGAP data.
Appendix B –
McEwan’s R1b Haplotypes Reduced
By mining YSearch and collecting
similar 37-marker haplotypes into clusters, McEwan (2007) has identified a
large number of R1b “types” that comprise the world-wide scope of this
data. It is interesting to consider how
the Sykes OGAP data relates to McEwan’s haplotypes. However to compare the data, McEwan’s modal
haplotypes had to be collapsed into smaller groups to reflect the smaller
number of markers in the OGAP haplotypes (Refer to Appendix A).
The following is the reduction of
the McEwan haplotypes to 10-marker haplotypes.
In each case, letter designations have been included here for the
purpose of mapping the full set of McEwan haplotypes (designated R1bSTR##) to
the reduced set used in this analysis (Letter Groups). While this exercise
necessitates the loss of considerable resolution in the McEwan haplotypes, the
exercise is included here to provide traceability to the analysis included in
Appendix A.
[1] In the case of academic researchers,
many keep tight control of their data.
It is hoped that in the future that scientific journal editors will
require the submission of supporting data, which they might even hold for a
period of time even after publication of analytic articles. Such submissions would ensure that important
research is fully documented even if it is years later when the article is no
longer at the forefront of everyone’s mind.
Address for
correspondence: [email protected]
[2] For the Campbell Project, the
distribution of the birth year of the oldest proven ancestor of the
participants is as follows. 4% earlier
than the 1600s, 7% in the 1600s, 56% in the 1700s, and 33% in the 1800s. There is no reason to expect this to be
anything but representative, and in fact, one could be convinced that some of
these participants are outliers with longer than usual paper genealogies.
[5] For those markers that have been reported in the
full data set. i.e., Alleles of 12
include: DYS425 (1412/1496), DYS439
(1411/1496), DYS426 (1088/1496), DYS388 (1108/1496).
[7] The results column of Table 2 has
the same relative weighting as if samples observed were divided by sample size
(e.g. Ireland = 2/17) but this is just an alternative formula.
[8] OGAP haplotypes below OGAP30 have
single digit sample sizes.
[10] http://www.m222.net/R1b1c7
[12] Mark McDonald writes, “The Dal Riada (Dalriads) leadership who came
from Ireland in circa 500AD into what is now Argyll spoke a language akin to
what is now called Erse (Irish Gaelic to the Scots) and introduced that
language into Scotland - the root of modern Scots Gaelic. They were called 'Scoti' by the Romans it is
said - a word for 'raider' used in those days, and it is the root of the name
[14] When writing about Argyll, Sykes
writes, “However, the genetic signal, as
far as I can judge, points to a substantial, and by the look of it, hostile
replacement of Pictish males by Dalriadian Celts, most of whom relied on
Pictish rather than Irish women to propagate their genes.” Sykes, Blood of the Isles, page 210.
[15]
Other researchers suggest that this haplotype might be Dal Riadic Celt
(see::
http://searches2.rootsweb.com/cgi-bin/igetch2?/u1/textindices/G/
GENEALOGY-DNA+2004+14773453734+F ).
However, the ubiquitous presence of
OGAP4 across
[16] Again this analysis confirms the
statement on page 239 of Sykes book, “The
Atlantis Chromosome, the prevalent Y chromosome in the Clan, is very frequent
in
[17] Sykes, Blood of the Isles, page
282. It should be noted that Oppenheimer
writes very little about the Picts in his book, Origins of the British. The reason may be that Oppenheimer’s analysis
based on the Capelli data relies only on 6 markers instead of 10. The lack of markers DYS439, DYS389-1, and
DYS389-2 causes the Pict haplotypes (OGAP4) to be grouped with, and mask by,
OGAP2 and OGAP12 in the Capelli data.
[18] Sykes, Blood of the Isles, page
214 – “So far, we have four possible
influences on the genetic structure of the people of Scotland, first the Picts,
then the Gaels of Ireland, synonymous with the Celts, the Vikings, and in the
south of Scotland particularly, the Anglo-Normans.”