Saturday, May 26, 2012

Beware the "calculator effect"

Many people are getting skewed results from so called DIY admixture calculators. For instance, users from the UK often come out much more continental European than they should. Some of them actually believe that this is because they're genetically more Norman or Saxon than the average Brit.

No, the real reason is what I call the "calculator effect". This is when the algorithm gives different results to people who are part of the ADMIXTURE runs that produced the allele frequencies used by the calculators, than to those who aren't, even though both sets of users are of exactly the same origin, and should expect basically identical results.

So, is it possible to get around this calculator effect? Yes, people who aren't included in the datasets that produce the allele frequencies used by the calculators shouldn't compare their results to those who are, including the academic references used. They should only compare results to those of other calculator users. On the other hand, members of the various projects who are run as references, should only compare their results to other project members and relevant academic references.

I've put together a quick experiment to show the "calculator effect" in full force. I ran two intra-North European ADMIXTURE analyses at K=3, Test1 and Test2, and included myself (PL1) only in the former. These tests were almost identical, except for the fact that I wasn't part of the second run. I then tested my genome with calculators made from the allele frequencies from the two runs.

My calculator results for Test1 were very similar to the results I received from ADMIXTURE, and made perfect sense based on my ancestry. However, the calculator results for Test2 were way off, and basically made me look like a different sample from some other part of Europe. I even managed to score above noise level Far Eastern ancestry in the calculator version of Test2. Please note, however, that all the other individuals received almost identical scores in both tests. The results from the experiment can be seen in the spreadsheet below.

Calculator Effect K=3

I have to say I'm disappointed that no one else is talking about the calculator effect, and how to remedy it. I actually designed my Eurogenes ancestry tests for Gedmatch with this problem in mind, by only using academic references to source the allele frequencies. This means that test results for Eurogenes project members and non-members are directly comparable. Perhaps other genome bloggers can eventually do the same?

