Skip to content

Some problems in the msmodified program #2

Open
@jalhackl

Description

@jalhackl

Dear IntroUnet developers,

Thank you for developing this valuable deep learning tool for detecting introgression. After testing your codes, we found some problems in the msmodified program.

1. The random number generator in msmodified does not work.

The command

python3 src/data/simulate_msmodified.py --model archie --odir /pine/scr/d/d/ddray/archie_sims_new --n_jobs 25

from the readme creates 25 times the same data.

Plotting the number of polymorphisms for every sample of the 25 jobs, one gets a pattern like this:
polymorphism_pattern_msmodified

The reason is that in msmodified some concerning parts have been added, which make the usage of the seed actually impossible.
The main cause of the undesired behaviour is the addition of the following lines in ms.c:

introNets/msmodified/ms.c

Lines 693 to 696 in f204924

while( arg < argc ){
if (strcmp(argv[arg], "-seeds") == 0) {
break;
}

Other modifications in ms.c are likewise problematic, but are luckily never called:

introNets/msmodified/ms.c

Lines 870 to 873 in f204924

case 'seeds':
for( i=0; i < 3; i++)
arg++;
break;

Furthermore, it seems that the files containing the random number generator functions are not identical to those from the current version of ms. In particular, the return type of the functions of one of the two implemented random number generators, rand1.c, was changed; this is also the reason for the weird (undefined) behaviour according to the readme:

Using the ```rand2.c``` is necessary to avoid segfaults with our simulation commands for no introgression.

(In fact, currently rand1.c only works for msmodified, rand2.c for ms, due to the reasons indicated above)
A straightforward solution seems to be to switch back to the original ms-functions and delete the seed-concerning modifications shown above from msmodified.

2. msmodified leads to additional polymorphisms in the simulated data.

msmodified simulates in total 202 individuals: 100 from the introgressed population, 100 from the non-introgressed population, one from the source and one from a fourth population which splits from the source population at an arbitrary time (it seems that this population solely exists to match the structure required by msmodified).
For this reason, the genotype matrices contain more polymorphisms than they would if only 200 individuals (100 from the introgressed population and 100 from the non-introgressed population) had been sampled.
These additional polymorphisms likely influence the prediction performance and give rise to a pattern: polymorphisms which are always 0 in the introgressed and non-introgressed populations (and thus would not appear if only the individuals from these populations would have been sampled) are clear indicators that at these positions there was a mutation in the source population (or the fourth split population). If it was a mutation only in the source population, it indicates that no introgression has happened at this locus.
It is possible that the effect is dampened (in any case, it is likely disturbed) due to the fourth ‘superfluous’ population: A 0-column can be due either to a mutation in the source population or the fourth population. Nonetheless, it is likely that a pattern is introduced which in any case thwarts the ghost introgression setting because the additional polymorphisms contain information about the unknown population.
(The parser for multiple functions contains an argument densify which never gets used – was it perhaps thought for removing the 0-columns?

parser.add_argument("--densify", action = "store_true", help = "remove singletons")

In any case, it seems that the full genotype matrices – with the additional polymorphisms – are used for the further steps.)

cc @kuhlwilm @xin-huang

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions