Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI returning haploid samples, API diploid #1451

Open
andrewkern opened this issue Mar 9, 2023 · 6 comments
Open

CLI returning haploid samples, API diploid #1451

andrewkern opened this issue Mar 9, 2023 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@andrewkern
Copy link
Member

In comparing output from the API and the CLI during the probgen workshop I noticed a discrepancy in the ploidy output through a conversion to vcf. I think the issue is on the CLI side.

Here's an example. First using the CLI

$ stdpopsim AraTha 10 \
--demographic-model SouthMiddleAtlas_1D17 \
--chromosome chr1 \
--length-multiplier 0.1 \
--output aratha.ts

$ tskit vcf aratha.ts > aratha.CLI.vcf
$ head aratha.CLI.vcf
##fileformat=VCFv4.2
##source=tskit 0.5.4
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=3042767>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6tsk_7	tsk_8	tsk_9
1	25	0	C	A	.	PASS	.	GT	0	1	0	0	0	0	0	00	0
1	169	1	G	C	.	PASS	.	GT	1	0	1	1	1	1	1	10	0
1	300	2	T	C	.	PASS	.	GT	0	0	1	0	0	0	1	00	0
1	613	3	A	C	.	PASS	.	GT	1	0	1	1	1	1	1	11	0

notice the ploidy above

I can compare this to output from the API

species = stdpopsim.get_species("AraTha") 
model = species.get_demographic_model('SouthMiddleAtlas_1D17')
samples = {'SouthMiddleAtlas':10}
engine = stdpopsim.get_engine('msprime')
contig = species.get_contig("1", length_multiplier=0.1) #fill me!
ts = engine.simulate(model, contig, samples)
f = open("aratha.API.vcf","w")
ts.write_vcf(f)

then let's look at the vcf output from the API

$ head aratha.API.vcf
##fileformat=VCFv4.2
##source=tskit 0.5.4
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=3042767>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6tsk_7	tsk_8	tsk_9
1	35	0	A	G	.	PASS	.	GT	0|1	0|0	0|0	0|1	0|0	0|0	0|0	0|0	0|0	0|0
1	152	1	C	T	.	PASS	.	GT	1|0	0|1	0|0	1|0	1|1	1|0	1|1	1|0	1|0	0|1
1	263	2	T	A	.	PASS	.	GT	1|0	0|1	0|0	1|0	1|1	1|0	1|1	1|0	1|0	0|1
1	276	3	A	G	.	PASS	.	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0

diploid, as I would have expected.

I'm guessing this has to do with the assignment of ploidy through the CLI. Could it be that contigs have the wrong ploidy here?

@andrewkern andrewkern added the bug Something isn't working label Mar 9, 2023
@andrewkern
Copy link
Member Author

a little bit more info here.. Doing a dry run from the API indicates that contig ploidy is not the issue:

Contig Description:
    Contig origin: chr1:0-3042767
    Contig length: 3042767.0
    Contig ploidy: 2
    Mean recombination rate: 8.064516129032259e-10
    Mean mutation rate: 7.1e-09
    Genetic map: None

@jeromekelleher
Copy link
Member

Can you give us the output of tskit info aratha.ts please @andrewkern? That should tell us whether the individuals are present or not (which is what tskit uses to detect ploidy)

@nspope
Copy link
Collaborator

nspope commented Mar 11, 2023

I see what's happening here: @andrewkern, you're using the deprecated API for sample specification, which is haploid for back compatibility. If instead, you run:

stdpopsim AraTha SouthMiddleAtlas:10 \
--demographic-model SouthMiddleAtlas_1D17 \
--chromosome chr1 \
--length-multiplier 0.1 \
--output aratha.ts

tskit vcf aratha.ts > aratha.CLI.vcf
head aratha.CLI.vcf

##fileformat=VCFv4.2
##source=tskit 0.5.5.dev0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=3042767>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  tsk_0  tsk_1    tsk_2   tsk_3   tsk_4   tsk_5   tsk_6   tsk_7   tsk_8   tsk_9
1       122     0       C       T       .       PASS    .       GT      1|0    1|0      0|0     1|0     0|0     0|1     0|0     0|1     0|0     0|0

@nspope
Copy link
Collaborator

nspope commented Mar 11, 2023

We may want to print a warning regarding ploidy, if the deprecated API is used? Currently, the warning that is printed is:

:(base) natep@poppy:~$ stdpopsim AraTha 10 --demographic-model SouthMiddleAtlas_1D17 --chromosome chr1 --length-multiplier 0.1 --output aratha.ts
WARNING: The use of `DemographicModel.get_samples` (Python API) and positional sample counts (CLI) is deprecated. Instead, supply a {population_name:num_samples} dict to `Engine.simulate(samples=...)` (Python API); or use the syntax `stdpopsim SpeciesName population_name:num_samples` (CLI)

@andrewkern
Copy link
Member Author

nice catch @nspope. beyond updating the warning, we're going to have to update the workshop notebooks (i'll volunteer given that I just ran a workshop), and probably should change the documentation. currently we say this about the samples positional arg to the CLI
image

@nspope nspope self-assigned this Mar 22, 2023
@nspope nspope added this to the Version 0.2.1 milestone Mar 22, 2023
@petrelharp
Copy link
Contributor

Andy updated the notebooks already; it looks like the warning needs updating still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants