Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using left/right, output coordinate system is relative to start of contig #1404

Closed
nspope opened this issue Oct 17, 2022 · 6 comments · Fixed by #1570
Closed

When using left/right, output coordinate system is relative to start of contig #1404

nspope opened this issue Oct 17, 2022 · 6 comments · Fixed by #1570
Assignees
Milestone

Comments

@nspope
Copy link
Collaborator

nspope commented Oct 17, 2022

The options left, right in Species.get_contig allow a portion of a chromosome to be simulated (in contrast to simulating an entire chromosome and "masking" everything outside of [left, right]). When adding DFEs/inclusion masks/sweeps/etc to the contig, the coordinate system used is that of the original chromosome.

However, the coordinate system of the output tree sequence will be relative to the start of the contig, not of the original chromosome. So for example, if I simulate chr22 from 30 Mb to 50 Mb then dump a VCF, the coordinates of variants won't reflect their positions on the original chromosome.

Instead, should the output tree sequence span the entire chromosome and have missing data outside of [left, right]?

@nspope nspope changed the title Output coordinate system when using left, right is relative to start of contig When using left/right, output coordinate system is relative to start of contig Oct 18, 2022
@jeromekelleher
Copy link
Member

Instead, should the output tree sequence span the entire chromosome and have missing data outside of [left, right]?

Probably yes I think, it's the less bad option.

@nspope nspope added this to the Version 0.2.1 milestone Oct 18, 2022
@nspope
Copy link
Collaborator Author

nspope commented Nov 16, 2022

Should be able to do this by removing trim=True from here, and removing the coordinate "shifting" for DFEs and exclusion masks, I think.

@dschride
Copy link

We discussed this a bit in the hackathon at the meeting before prob gen and my argument was that for many uses people will be interested in generating simulated data that "look like" some region and then plugging that into some pipeline that would want coordinates starting with 1. (E.g. calculating summary stats on a bunch of sims before running ABC.)

So one option is to leave the current behavior as is, while warning the user that the rest of the chrom is not simulated, and maybe add in another flag for having the tree seq span the whole chrom and doing the missing data thing as proposed above.

@nspope
Copy link
Collaborator Author

nspope commented Mar 14, 2023

If we're outputting the original coordinate system with "flanks" outside of the simulated window set to missing, then trimming to the window (thus resetting the coordinate system) is as simple as ts_new_coord = ts.trim(). So, I'm not sure we'd need to provide an option for this in the python API, we could just mention trim method in the docstring.

We could add a --trim-missing flag to the CLI, that'd remove missing/masked flanks (this would also trim off parts of chromosomes where recombination maps are NaN, like the first 15Mb of chr22).

@nspope nspope self-assigned this Apr 23, 2023
@hyanwong
Copy link
Contributor

hyanwong commented Mar 22, 2024

I have just encountered this problem and it confused me, because I wanted to add a reference sequence to the simulated genome. I agree with @nspope and @jeromekelleher that we should not "trim by default", and we should simply have empty flanks (empty trees with tree.num_edges=0) if we specify left>0 and right < seq_len. This will also preserve the coordinate system of the simulated chromosome, in case the positions are used e.g. for annotations etc.

@petrelharp
Copy link
Contributor

In doing this, we should rename the original_coordinates attribute of Contigs (and maybe keep .original_coordinates as a deprecated alias?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants