Search This Blog

Monday, January 23, 2012

Eurogenes' North Euro clusters - phase 2, final results

This is a continuation of my ChromoPainter analysis of Europeans from north of the Pyrenees, Alps and Balkans (see here). To obtain the most accurate results possible on my laptop, I increased the burn-ins and iterations in fineSTRUCTURE to 500K each (5 hour run in all, which is all I'm willing to put this machine through). The end product looks very similar to my initial analysis, in which I explored the data at 200K burn-ins and iterations. What I think this shows is that the results are robust, and I doubt they'd change much even after a couple of days of running fineSTRUCTURE.

Indeed, as mentioned in my previous blog entry, this appears to be the most detailed and accurate cluster analysis of this part of Europe produced anywhere to date. There are 21 clusters in all, with at least 20 looking like strong signals of genetic substructures across North, West, Central and East Europe (see spreadsheet for individual classifications). They include:

pop0 - West Finnish1: This is a pair of reference individuals, most likely from Western Finland, judging by their PCA and ADMIXTURE results. They are either from the same community, or have a very similar mix of very specific ancestries.

pop1 - Erzya + Moksha: This includes all of the Erzya and Moksha in the project, plus a Russian with recent Erzya ancestry. It's closely related to ethnic Russian clusters that stretch from Northwest Russia to near the Volga, and also to the Estonian cluster.

pop2 - South/Central Finnish: This is the largest Finnish cluster, and that's probably more than just the result of sampling bias. I would say that the greater part of the Finnish population would belong to this type of cluster, which occupies regions of highest population density within the country.

pop3 - Fenno-Scandian: This cluster includes a Northern Swede, a Swede with probable recent Finnish ancestry, and Finns with probable recent Swedish influence. I have a feeling that Finland Swedes and Aland Islanders would also be placed here more often than not.

pop4 - Northwest Russian/Southeast Finnish: Although this cluster includes only two individuals, it's definitely much more than just the result of two relatively closely related samples being in the same run. I'd hazard a guess that Northwest Russians with, say, significant Ingrian ancestry, would land here, and so would Finns with recent Russian ancestry.

pop5 - West Finnish2: Based on PCA and ADMIXTURE results, most of these Finns likely come from Western Finland, probably from places like Southern Ostrobothnia. They possibly also have some Swedish influence.

pop6 - West German: This cluster is based on individuals from Western and Northwestern Germany. It also includes a Dutchman, Austrian and people of mixed origin, like a Dane with French and German ancestry, and Americans with British, German, Scandinavian and/or Polish ancestry. In other words, this is where Northwestern Europe meets Central Europe.

pop7 - Vologda Russian: Most of the Vologda Russians from the HGDP land here, so this appears to be a local cluster. Judging from its phylogeny, it looks like a mix of North Slavic, Baltic and Finnic influences.

pop8 - East Finnish: All the project and reference Finns with substantial ancestry from new settlement areas of Eastern Finland appear in this cluster. No wonder then, that this is the cluster with the highest chunk count in this analysis.

pop9 - Estonian: This is a mixed cluster, including individuals from Estonia, and, as far as I know, Russians with substantial ancestry from near Estonia. As mentioned above, it's closely related to the Erzya + Moksha, Northwest Russian and Vologda clusters. However, it's clearly much more western than any of these clusters (for instance, see the PCA below), which suggests Germanic influence in its makeup.

pop10 - Cornish: Almost all of my Cornish samples from the 1000 Genomes Project feature in this very local cluster, which shows the highest chunk count among the Western European samples. The overall results suggest a lack of outbreeding in recent times.

pop11 - French/Belgian: Interestingly, this cluster includes the bulk of the French samples, a French Canadian, and two Belgians. On the other hand, the most northerly French are placed in the more cosmopolitan Northwest European cluster (see below).

pop12 - Lithuanian: All of the more or less pure Lithuanians fall in this cluster. Those that don't are a reference sample from Behar et al. 2009, who always appears very Belorussian like in other analyses, and here sits in the East Slavic cluster, and a project member with recent German ancestry (LIT3). The Western European influence carried by the latter pushes him into the Polish/West Ukrainian cluster, despite not having any documented Polish or Ukrainian ancestry.

pop13 - Northwest Russian: This cluster appears to be made up of Russians who have more Finnic, and/or perhaps Eastern Baltic, ancestry than the individuals in the East Slavic cluster. In other words, it's more northerly, less westerly, and more closely related to the Finnic-speaking Erzya, Moksha and Estonians.

pop14 - Irish + West British: Most Irish individuals fall in this cluster, as well as British samples from Western Scotland and Wales. It's tempting to correlate this cluster with Celtic genetic ancestry in the Isles.

pop15 - South/West Scandinavian: This is basically a Norwegian and Southern Swedish cluster. It also features Swedes from other parts of the country who most likely have some German, Walloon and/or French influence.

pop16 - East German: This cluster includes individuals with significant or even overwhelming Germanic ancestry, but also with very clear Western Slavic input. One of the individuals here is of mixed Polish, German and Swedish ancestry, which pretty much sums up the character of this cluster in a modern context. The presence of two Hungarians from Behar et al. 2009. isn't surprising, because Hungary was settled by both Germanic and Western Slavic groups from the early Middle Ages until modern times.

pop17 - Northwest European: I had reasonable hopes of breaking up this large cluster into a couple of units at least. However, that did not happen, and I don't think it will unless I obtain more samples from the relevant areas of Europe, like Holland and specific parts of the UK. I think the main reason this cluster failed to budge was because of its cosmopolitan nature. In other words, the samples here include some of the most outbred in the analysis, and this, coupled with the fact that they carry very similar ancestral components, means that fineSTRUCTURE doesn't have anything to latch onto to create divisions.

pop18 - East Scandinavian: This could also be called a Swedish cluster. It's almost entirely made up of Swedes, usually from Eastern or Southeastern Sweden, and/or occasionally with recent Finnish influence.

pop19 - Polish/West Ukrainian: The vast majority of the Poles fall in this cluster, and about half of the Ukrainians from Yunusbayev et al. 2011. Most of these Ukrainians appear to be from the Lviv district in the west, and some might even have fairly recent Polish and/or German ancestry. In fact, I would say the latter is a good bet for UkrLv240Y, who shows large Western European segments on several chromosomes.

pop20 - East Slavic: All of the Belorussians cluster here, and so do Russians from near Belorussia and Ukraine, and almost half of the Ukrainians from Yunusbayev et al. 2011 (those who show more easterly genetic characteristics). An individual of mixed Polish and Lithuanian ancestry also makes an appearance here, suggesting that one of the main factors differentiating this cluster from the Polish/West Ukrainian group is a higher level of Baltic admixture in the former.

pop21 - East Central European:
This cluster is based on most of the Hungarians in my dataset, but it also includes a number of Western and Southern Slavs, often with significant German ancestry. Not surprisingly, this cluster shows very high affinity with both the East German and Polish/West Ukrainian clusters.

Let's now move on to some graphics. Below, in order of appearance, are the following: raw data coancestry matrix, showing the placement of individual samples; aggregate coancestry matrix, showing the populations (or clusters) described above; pairwise coincidence matrix, which is useful for spotting very recent ancestral ties; a PCA plot of the 21 clusters. More detailed ChromoPainter/fineSTRUCTURE PCAs of Western Europe can be found at this link.

Finally, those of you who wish to run your own experiments with the ChromoPainter datasheets from this analysis can download them here. Please note, the sheets don't reveal any raw or traits/disease data.

Saturday, January 14, 2012

Eurogenes' North Euro clusters - phase 1, exploring the data

I have some preliminary results from a new intra-North Euro cluster analysis, using a cutting edge tool called ChromoPainter. More than 400 samples and 270K SNPs were tested, in linkage mode, and then the output processed in fineSTRUCTURE at 200K burn-ins and iterations. Like I say, the results should be treated as preliminary, but they already look better than any other cluster analysis I've ever seen dealing with Europe north of the Alps, Pyrenees and Balkans. The algorithm identified 21 clusters, with most located in Eastern and Northeastern Europe (see spreadsheet for details). Below are two plots showing how the clusters relate to each other via a tree diagram and heat maps – the first shows an aggregate view, and the second the individual samples.

It's interesting that the Baltic Finns seem to create clusters at a drop of a hat, but they also share the highest number of chunks, and the longest chunks, than any other group. Indeed, all of the Finnish clusters are closely related, and many of the individuals, especially from East Finland, even look like distant relatives on the heat map (note the ultra-hot, blue squares). On the other hand, the large Northwestern European cluster, featuring samples from across the UK, as well as from several nearby countries, is holding firm, and might be tough to break up in this analysis.

I have some theories about the reasons for the obvious genetic homogeneity and diversity in Western Europe, and these include the effects of the Black Death. It decimated many populations in the western half of the continent, thus encouraging migrations into emptied areas, and eventually leading to more open, mobile societies. It's an interesting subject, and I might write much more on it in the future. Meantime, here's a PCA plot from the ChromoPainter chunk counts data. Note the large distances spanned by groups from Northern and Eastern Europe, and the tight bundle of samples from the west, mostly from the UK, Ireland, France and the Low Countries. Interestingly, and perhaps counter-intuitively, it's the closely related Finns who take up most of the space on the plot.

The first component picked up by this PCA appears to be an Atlantic one. It peaks in the Cornish samples, but shows similar levels in all the British, Irish, French, Dutch and Belgians (post-Black Death mobility?). If we are to assume that I identified the component correctly, then it appears as if the East Finns, Vologda Russians, Erzya from the Middle Volga, and Lithuanians are the least “Atlantic” samples in this analysis. These groups, especially the East Finns, also happen to act like relative genetic isolates in many of my experiments (such as ADMIXTURE and MDS analyses). Thus, it seems they've been sheltered from significant gene flow from outside in recent times, including from the west, like German migrations to East Central Europe, and Scandinavian influence in Western and Southwestern Finland.

The analysis also produced a lot of detailed data showing phased half-segment matches between all individuals. In theory, it should be possible to use this information to create chromosome paintings for the people involved - much like the Ancestry Painting feature at 23andMe, but obviously with 21 potential North European reference groups, instead of 3 inter-continental ones. We shall see how that works out.

I'll stop rambling at this point, and attempt to break up that large Northwestern cluster (Pop21), and perhaps also the French cluster (Pop7). If they don't budge this time, perhaps they will in future runs with more samples? Indeed, I'd like to try a Eurasian-wide analysis, but might need more powerful hardware for that sort of an undertaking.

Update: Eurogenes' North Euro clusters - phase 2, final results