Originally Published: Vol 9, Num 1 (Fall 2021)
Reference Number: 91.006
Note: The full article is too long to reproduce here but the Abstract through Introduction are provided below. For the full article please use the View PDF link.
The genetic genealogy community has many tools for autosomal DNA analysis, and many tools and techniques have been developed to use autosomal DNA match inter-relationships to assist in the identification of common ancestors. Many of these techniques work best with matches who share larger amounts of DNA and are therefore closer relatives whose genealogical connections are more readily discovered. This review discusses the merits of yet another technique, network visualization, which can cluster large datasets of matches even lower than 20 cMs (in this review, down to 7 cMs) and can identify and analyze clusters of In-Common-With matches which, especially when combined with other genealogical information like known relationships of certain matches or clusters determined by other methods, can help focus and prioritize our analysis of matches to find our shared ancestry and thereby extend our genealogical knowledge. In this review the Gephi tool was used as the network visualization platform but the approach is independent of the specific tool.
This review is not intended for those new to autosomal DNA analysis; not because the techniques are difficult to understand but because there are more commonly-suggested starting analysis techniques like the Leeds Method or even the analysis tools provided by commercial companies, and this review covers an approach which might be more useful after those starting techniques have been exhausted.
For that reason, this review does assume that readers have a basic familiarity with other autosomal DNA match analysis techniques like the Leeds Method and some fundamentals of autosomal DNA analysis for genealogy like the relationship of shared autosomal DNA segments to genealogical relationships between matches.
When many of us take our first autosomal DNA (atDNA) test what we are hoping to find are matches who will help us figure out the gaps in our knowledge of our own ancestry. We hope for as large a pool of matches as possible with the somewhat mixed blessing that we then have to untangle the often difficult questions of how they all may be related to us and to each other.
Very often a key subset of matches will share larger centimorgans (cMs) of DNA with us and our relationship with those matches will be closer and clearer (say, perhaps within 3rd cousins). For this subset, where the genealogical relationships between ourselves and those matches don’t become clear by simply comparing our known genealogies we can often ferret out the relationships using our better-known atDNA analysis techniques: Leeds Method, segment matching, and so on, or using the suite of deservedly-popular tools which have been developed by the commercial companies and third parties in support of those techniques.
The success rate of our most common techniques though drops off rapidly with more distant relationships and as the shared DNA segments get smaller. While sometimes there is no substitute for doggedly researching and comparing genealogies to find common ancestors, these more distant matches can still be frustrating for genetic genealogists especially if there are a large number of matches in the “4th cousin and further” category whose genealogical relationships are unclear and who are especially resistant to analysis with our most common techniques.
A lesser-known approach especially for tackling these more distant matches is analysis using network visualization software to group them into In-Common-With (ICW) clusters – groups of matches who are themselves matches to each other and who may as a result all descend from a common ancestor.
This is not a new approach – network visualization approaches have been used for ICW cluster analysis at least as far back as 2017 by Barbara Griffiths and Shelley Crawford using a variety of tools including Pajek and NodeXL. In this review we have used the Gephi tool (free and open-source at https://gephi.org/) as the clustering and graphing platform but except for different flexibility in clustering and filtering options, the approach is independent of the tool and any similar network visualization software package would support the same approach.
Others are also applying network visualization to their own autosomal data analysis using similar but not identical approaches to the examples shown in this review. Their results are often displayed on social media forums to other genetic genealogists and generate much surprise and discussion, which suggests that the techniques are not widely practiced and might benefit more people if they were more widely understood.
Network visualization is not a replacement for more common analysis techniques; in fact as a clustering approach for ICW matches it is very similar to a Leeds Method analysis though more complicated to set up the necessary data and to analyze. This is one reason that a simpler method like Leeds should be attempted first, but another reason is that as we will explain in this review, mapping an initial subgroup of matches to their genealogical relationships using other methods first can be extended by network visualization to wider clusters of more distant matches. This means that network visualization is not only a stand-alone technique but also an approach that can extend the results of prior analysis.
The other advantage of network visualization is that it can be used to very quickly sort large networks into clusters and then explore these clusters to investigate their shared origins, and to analyze the network as a whole through the application of different filters that highlight important relationships both within and across clusters.
While we present a few examples of network visualization analysis of ICW clusters in this review, we would also propose that network visualization should be considered a more general approach which may include several analysis techniques depending on desired outcome and number and type of autosomal matches. Our main point in writing this review is simply to show by example the merits of network visualization as an important tool in an analysis toolkit, not to suggest that there is one best network visualization analysis approach.
In this review we show by example how a network visualization tool like Gephi can be used to sort a large ICW network into clusters which include matches at smaller levels of shared cMs. We offer one approach for extending the relationship knowledge of a small number of matches out to the rest of the identified clusters, and also show how the tool’s filtering options can be used to dissect the network in different ways to gain additional insights.