Monday, May 8, 2017
An nMonte and 4mix guide for the participants of the Basal-rich K7 and/or Global 10 tests of the Eurogenes Project
Copied from a thread at the Anthrogenica forum because unfortunately it seems that a lot of people can't access the post: This is an nMonte and 4mix guide I have written for people who donated to the Eurogenes Project in order to take part in the Basal-rich K7 and/or Global 10 tests of that project and subsequently received their test results. For information on how to participate in one or both of the Basal-rich K7 and Global 10 tests, see the link below: Fund-raising offer: Basal-rich K7 and/or Global 10 genetic map In your results you receive from Davidski by email, you are provided with your Basal-rich K7 component percentages and your position on the Basal-rich K7 PCA if you took the Basal-rich K7 test, and your Global 10 PCA coordinates and your position on the Global 10 PCA if you took the Global 10 test. You will need your Basal-rich K7 component percentages and/or Global 10 PCA coordinates in order to make use of nMonte and 4mix, which allow you to be modeled as a mix different populations in varying ancestry percentages and varying distance levels based on either of your Basal-rich K7 and Global 10 results. You can download nMonte and 4mix from these links respectively: nMonte 4Mix Because that it can run multiple targets at the same time, I gave the link to 4mix_multi rather than classical 4mix. They are basically the same in all other aspects. In order to use nMonte and 4mix you need to have the R software installed on your PC. You can download it from one of the mirrors here: CRAN mirrors Making a target file for Basal-rich K7: Open Notepad and copy and paste the Basal-rich K7 component names and your Basal-rich K7 component percentages along with your name in this format: Basal-rich K7 spreadsheet Global 10 datasheet Save the input file as input. Here is an example of a Basal-rich K7 input file for nMonte: https://www.familytreedna.com/groups/anatol-balkan-caucas/about https://www.facebook.com/groups/800912433320422/
Sunday, March 19, 2017
I'm now taking donations for 2017. Anyone who donates $12 USD or $16 AUD, or more, will get the Basal-rich K7 ancestry proportions. Of course, you'll need to send me your genotype data for that to happen (Ancestry.com, FTDNA or 23andMe).
D(Yoruba,Iran_Neolithic)(Villabruna,AfontovaGora3) 0.0223 Z 2.812On the other hand, the Basal-rich K7 models the early Zagros farmers as 39.05% Ancient North Eurasian and 56.67% Basal-rich (which is probably a composite of Basal Eurasian and something Villabruna-related). To me this appears to be the more sensible solution. Moreover, Lazaridis et al. 2016 characterized South Caspian forager Iran_HotuIIIb as more Basal Eurasian than the early Zagros farmers (Supplementary Information 4). The Basal-rich K7, on the other hand, shows the opposite. The D-stat below suggests that the Basal-rich K7 is closer to the truth.
D(Chimp,Ust_Ishim)(Iran_Neolithic,Iran_Hotu) 0.0156 Z 1.337There are other such examples, and I might post them in the comments. In any case, the point I'm making is that the Basal-rich K7 is a solid piece of work and it's likely to remain relevant for a long time. Indeed, I'll be updating the Basal-rich K7 spreadsheet regularly as new ancient samples roll in, which means that you'll be able to model yourself as newly sampled ancient populations using the Basal-rich K7 ancestry proportions (for instance, see here). The only problem with this test is that it's optimized for Eurasians. As a result, it might be sensible for anyone with significant (>5%) Sub-Saharan ancestry to skip the Basal-rich K7 and just ask for the Global 10 genetic map and coordinates. Global 10 coordinates to model your ancient and recent fine-scale ancestry, just as you would using mixture proportions. In fact, I'd say the Global 10 coordinates are more useful in this respect than any mixture test, including the Basal-rich K7. Thanks in advance for your support. Keep in mind that the more cash I raise the busier things will be on this blog in 2017, which, by all accounts, is shaping up to be the year for ancient DNA.
Thursday, September 22, 2016
Judging by the Google search terms that are bringing traffic to this and my other blogs, a total newb to the scene is analyzing the Orcadian samples from the HGDP at GEDmatch with my K15 test. Please keep in mind that you will not see coherent results for many of the academic samples available online when using my tests. That's because I used these samples to source the allele frequencies for the tests. As a result, their ancestry proportions will often be very different from those of other samples from the same ethnic groups that were not used in this way. I call this problem the calculator effect, and it's described in my blog posts at the links below:
Wednesday, July 22, 2015
A few people are asking me about the effects of marker overlap or genotype rate on test accuracy. Logic dictates that the better the overlap, the more accurate the results, but this isn't strictly true. Here's what I've learned over the years:
- accuracy doesn't necessarily improve with higher marker overlap, it improves (up to a certain point) with more markersIn other words, a well designed test based on 200,000 SNPs will produce very accurate results for a genotype file with a marker overlap of 50%. On the other hand, another well designed test, based on just 50,000 SNPs, is likely to produce less accurate results for a genotype file with a marker overlap of 100%.
- you will still see accurate results using as little as 25,000 SNPs, as long as the test doesn't suffer from any serious problems
- poorly designed tests, such as those based on less than 1000 reference samples, always produce garbage results no matter what the marker overlap
So how can you tell a well designed test from a poorly designed one? It's easy, just have a look at the results they're producing for people with less complex ancestry. For instance, ask a Lithuanian, Swede or Pole what they're seeing at the top of their oracles. Is the Swede seeing Swedish or, say, German? If the answer is German instead of Swedish, or at least some type of Scandinavian, then the test is garbage and best ignored.
By the way, the recent Allentoft et al. paper on the ancient genomics of Eurasia includes a useful discussion on the effects of missing markers on the accuracy of both ADMIXTURE and PCA results. Refer to section 6.2 in the freely available supplementary info PDF here.
Tuesday, May 12, 2015
Thanks to Eurogenes project member DESEUK1. A zip file with the R script, instructions and a couple of data sheets is available here.
So let's model Poles as a bunch of ancient genomes from Central and Eastern Europe using output from my K8 analysis.
Copy & Paste: source('4mix.r')
Copy & Paste: getMix('K8avg.csv', 'target.txt', 'HungaryGamba_EN', 'HungaryGamba_HG', 'Karelia_HG', 'Corded_Ware_LN')
After a few seconds you should see the results...
Target = 19% HungaryGamba_EN + 14% HungaryGamba_HG + 2% Karelia_HG + 65% Corded_Ware_LN @ D = 0.0062
Obviously the script can use ancestry proportions and/or population averages from any test, provided they're formatted properly. The accuracy of the modeling will depend on the quality of the input.
Update 19/05/2015: A new version of the 4mix script that can run multiple targets is available here, courtesy of Open Genomes.
Sunday, November 30, 2014
Monday, September 8, 2014
Update 01/01/2015: ANE is the primary cause of west to east genetic differentiation within West Eurasia.
As its name implies, the Eurogenes ANE K7 is specifically designed to estimate Ancient North Eurasian (ANE) ancestry. It's based on a series of supervised runs with the ADMIXTURE software, and freely available at GEDmatch under the Eurogenes Ad-mix tests tab.
The ANE component is not modeled on the Mal'ta boy or MA-1 genome, the main ANE proxy in scientific literature, because this sample didn't offer enough high quality markers for the job. So instead, I used the non-East Asian portions of several Karitiana genomes from the HGDP.
I wasn't sure what was going to come of that, but it actually seems to have worked out really well. Below are the results for several individuals that were not used in the making of the test, and clearly their ANE scores look pretty damn solid going by recent papers. For instance, both Lazaridis et al. and Raghavan et al. estimate the Karitiana Indians at just over 41% ANE (see here and here).
Karitiana_HGDP00998You can also cross-check your ANE score with the results in this spreadsheet and table. The spreadsheet includes ANE estimates for more than 2,000 individuals that I tested with the ADMIXTURE software in supervised mode (see here).
On the other hand, the table comes from the Lazaridis et al. preprint, which I'm sure many of you have read by now several times over. And please pay attention to the range of ANE proportions for each population, rather than just the point estimates.
Obviously, there are also six other ancestral components in this test (hence the K7 in the name). They're basically byproducts of me trying to isolate ANE, and don't necessarily mean anything. Nevertheless, here's a brief rundown of what I think some of them might represent...
Ancestral South Eurasian (ASE): this is a really basal cluster that peaks in tribal groups of Southeast Asia. It's probably very similar in some ways to the Ancestral South Indian (ASI) component described by Reich et al. a few years ago.The other three components should be easy to work out from their names. They're almost identical to several components with the same or similar names from my other tests.
Western European/Unknown Hunter-Gatherer (WHG-UHG): this essentially looks like a West Eurasian forager component, and includes the forager-like stuff carried by Neolithic farmers (Oetzi the Iceman has 40% of it).
Early Neolithic Farmer (ENF): I'd say that this is the component of the earliest Neolithic farmers from the Fertile Crescent.
Some of you might be wondering why this test doesn't offer an Early European Farmer (EEF) cluster. But the answer to that should be obvious by now. EEF is not a stable ancestral component. It's actually a composite of at least two ancient components, including the so called Basal Eurasian and WHG-UHG. If it really was a genuine ancestral component, like ANE, then I'm pretty sure I'd be able catch it with ADMIXTURE. But I can't.
Indeed, a really important thing to understand about the Lazaridis et al. study is that it doesn't actually attempt to estimate overall WHG-UHG ancestry in Europeans, but rather the excess WHG-UHG on top of what is already present in the EEF proxy Stuttgart.
Also worth noting is that this K7 can be a bit noisy. That's mainly because it's very difficult to correctly assign proportions of ancient ancestry to present-day samples. But like I say above, this test is basically designed to estimate ANE scores. If you're wanting to learn about your overall ancestry then I recommend the Eurogenes K13 and K15 tests.
Missing SNPs might also be an issue for some people. It stands to reason that results will be noisier with more missing markers and no calls.
Have fun and don't forget to make a donation at some point to the Eurogenes cause, via the PayPal tab at the top right of the page. This will help me to keep up with what's going on in the world of Paleogenomics, and continue blogging and running analyses.
Iosif Lazaridis, Nick Patterson, Alissa Mittnik, et al., Ancient human genomes suggest three ancestral populations for present-day Europeans, arXiv, April 2, 2014, arXiv:1312.6639v2
Raghavan et al., Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans, Nature, (2013), Published online 20 November 2013, doi:10.1038/nature12736
Corded Ware Culture linked to the spread of ANE across Europe