Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QC for AraThal DFE #1344

Open
chriscrsmith opened this issue Aug 29, 2022 · 9 comments
Open

QC for AraThal DFE #1344

chriscrsmith opened this issue Aug 29, 2022 · 9 comments
Labels
Model QC Quality control process for model addition
Milestone

Comments

@chriscrsmith
Copy link
Contributor

PR for new model:
#1324

Original paper:
Huber et al., 2018

Parameter values:
I took the gamma parameters from Supplementary Table 4, the genome-wide, additive-only model for A. LYRATA- due to challenges with simulating with selfing for A. thaliana. Not certain this is what the group had in mind, also there might be updates on simulating selfing.

Potential issues:

  • I used the estimates from A. Lyrata.
  • I'm only 90% certain there were zero neutral mutations estimated as part of this DFE.

QC'er requests:
Kirk Lohmueller :)

@chriscrsmith chriscrsmith added the Model QC Quality control process for model addition label Aug 29, 2022
@nspope nspope added this to the Version 0.2.0 milestone Sep 20, 2022
@petrelharp
Copy link
Contributor

I'm having a look at this. I am not seeing why we should use the A. lyrata-estimated DFE, when (a) the species listed is A. thaliana, and (b) we have a DFE for A. thaliana? It's true we do need to do something about simulating selfing, but isn't that orthogonal to this?

@petrelharp
Copy link
Contributor

Hm, also - the DFE is estimated only from nonsynonymous changes. So, to complement this we need an estimate of what proportion of mutations are synonymous (and we'll put these into the DFE as neutral).

@petrelharp
Copy link
Contributor

The A. thaliana data in the paper are from Durvasula et al; but they don't report numbers of synonymous/nonsynonymous mutations either; hmph.

@petrelharp
Copy link
Contributor

Hm, furthermore, unless I'm wrong, I think we actually want the percent of new mutations that are synonymous, which will be lower than the percent of segregating mutations that are synonymous.

petrelharp added a commit to petrelharp/stdpopsim that referenced this issue May 16, 2023
@petrelharp
Copy link
Contributor

petrelharp commented May 16, 2023

Proposal: just use the proportion of possible coding changes that are synonymous, which is 330/1728:

# from https://gist.github.com/juanfal/09d7fb53bd367742127e17284b9c47bf
codontab = {
    'TCA': 'S',    # Serina
    'TCC': 'S',    # Serina
    'TCG': 'S',    # Serina
    'TCT': 'S',    # Serina
    'TTC': 'F',    # Fenilalanina
    'TTT': 'F',    # Fenilalanina
    'TTA': 'L',    # Leucina
    'TTG': 'L',    # Leucina
    'TAC': 'Y',    # Tirosina
    'TAT': 'Y',    # Tirosina
    'TAA': '*',    # Stop
    'TAG': '*',    # Stop
    'TGC': 'C',    # Cisteina
    'TGT': 'C',    # Cisteina
    'TGA': '*',    # Stop
    'TGG': 'W',    # Triptofano
    'CTA': 'L',    # Leucina
    'CTC': 'L',    # Leucina
    'CTG': 'L',    # Leucina
    'CTT': 'L',    # Leucina
    'CCA': 'P',    # Prolina
    'CCC': 'P',    # Prolina
    'CCG': 'P',    # Prolina
    'CCT': 'P',    # Prolina
    'CAC': 'H',    # Histidina
    'CAT': 'H',    # Histidina
    'CAA': 'Q',    # Glutamina
    'CAG': 'Q',    # Glutamina
    'CGA': 'R',    # Arginina
    'CGC': 'R',    # Arginina
    'CGG': 'R',    # Arginina
    'CGT': 'R',    # Arginina
    'ATA': 'I',    # Isoleucina
    'ATC': 'I',    # Isoleucina
    'ATT': 'I',    # Isoleucina
    'ATG': 'M',    # Methionina
    'ACA': 'T',    # Treonina
    'ACC': 'T',    # Treonina
    'ACG': 'T',    # Treonina
    'ACT': 'T',    # Treonina
    'AAC': 'N',    # Asparagina
    'AAT': 'N',    # Asparagina
    'AAA': 'K',    # Lisina
    'AAG': 'K',    # Lisina
    'AGC': 'S',    # Serina
    'AGT': 'S',    # Serina
    'AGA': 'R',    # Arginina
    'AGG': 'R',    # Arginina
    'GTA': 'V',    # Valina
    'GTC': 'V',    # Valina
    'GTG': 'V',    # Valina
    'GTT': 'V',    # Valina
    'GCA': 'A',    # Alanina
    'GCC': 'A',    # Alanina
    'GCG': 'A',    # Alanina
    'GCT': 'A',    # Alanina
    'GAC': 'D',    # Acido Aspartico
    'GAT': 'D',    # Acido Aspartico
    'GAA': 'E',    # Acido Glutamico
    'GAG': 'E',    # Acido Glutamico
    'GGA': 'G',    # Glicina
    'GGC': 'G',    # Glicina
    'GGG': 'G',    # Glicina
    'GGT': 'G'     # Glicina
}

syn = {}
for x in codontab:
  syn[x] = 0
  for k in range(3):
   for nuc in n:
      y = list(x)
      y[k] = nuc
      other = codontab["".join(y)]
      if other == codontab[x]:
          syn[x] += 1

# {'TCA': 6, 'TCC': 6, 'TCG': 6, 'TCT': 6, 'TTC': 4, 'TTT': 4, 'TTA': 5, 'TTG': 5, 'TAC': 4, 'TAT': 4, 'TAA': 5, 'TAG': 4, 'TGC': 4, 'TGT': 4, 'TGA': 4, 'TGG': 3, 'CTA': 7, 'CTC': 6, 'CTG': 7, 'CTT': 6, 'CCA': 6, 'CCC': 6, 'CCG': 6, 'CCT': 6, 'CAC': 4, 'CAT': 4, 'CAA': 4, 'CAG': 4, 'CGA': 7, 'CGC': 6, 'CGG': 7, 'CGT': 6, 'ATA': 5, 'ATC': 5, 'ATT': 5, 'ATG': 3, 'ACA': 6, 'ACC': 6, 'ACG': 6, 'ACT': 6, 'AAC': 4, 'AAT': 4, 'AAA': 4, 'AAG': 4, 'AGC': 4, 'AGT': 4, 'AGA': 5, 'AGG': 5, 'GTA': 6, 'GTC': 6, 'GTG': 6, 'GTT': 6, 'GCA': 6, 'GCC': 6, 'GCG': 6, 'GCT': 6, 'GAC': 4, 'GAT': 4, 'GAA': 4, 'GAG': 4, 'GGA': 6, 'GGC': 6, 'GGG': 6, 'GGT': 6}

prop_syn = sum(syn.values()) / ((3 ** 3) * (4 ** 3))
sum(syn.values()), ((3 ** 3) * (4 ** 3)), prop_syn
# (330, 1728, 0.1909722222222222)

@chriscrsmith
Copy link
Contributor Author

This is sounding good to me

@petrelharp
Copy link
Contributor

On slack, @ckyriazis says:

  1. In the Huber et al 2018 paper, I believe they actually assume the same DFE for A. thaliana and A. lyrata. So it shouldnt make a difference
  2. In that paper, they also assume a ratio of nonsynonymous to synonymous sites of 2.31:1, as estimated by Huber et al 2017

@petrelharp
Copy link
Contributor

Well, I guess we should have read the paper more carefully. Indeed:
Screenshot from 2023-05-23 08-49-22

@petrelharp
Copy link
Contributor

Conclusion from the meeting now: implement the "genome-wide both species h-s relationship", following how @ckyriazis discretized this (@klohmueller to provide a relference).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Model QC Quality control process for model addition
Projects
None yet
Development

No branches or pull requests

3 participants