Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: training data to create a model #14

Closed
RunningMatcha opened this issue Jun 22, 2024 · 3 comments
Closed

Documentation: training data to create a model #14

RunningMatcha opened this issue Jun 22, 2024 · 3 comments

Comments

@RunningMatcha
Copy link

Hi @aquaskyline,

Thanks for developing such a great tool! I am currently testing your variant caller with ONT data originated from different organisms, and it could nicely recognize SNPs but I am having troubles with short deletions.
I was wondering whether it makes sense for me to train it with my own data to create a model. I saw that for Clair3 it is explained in the documentation how to train data, but I did not find the documentation for ClairS-TO.
If this is possible, would you please add in the documentation how users can create their own models?

Thank you very much!

@aquaskyline
Copy link
Member

Preparing training data for ClairS and S-TO is much more complicated than Clair3 because it uses synthetic data, and it seldom improves the calling performance because of the scale and heterogeneity needed in training data. On the other hand, you mentioned missing short deletions, would it be possible for you to send us some IGV screenshots of the missing short deletions so we could see if there is a solution.

@RunningMatcha
Copy link
Author

Hi aquaskyline.

Thank you for your quick reply!
I was perhaps too naive and thought that if we work with synthetic DNA the accuracy would improve if we train ClairS-TO with our sequences (not always derived from human DNA). Have you tested ClairS-TO with different organisms?
After your reply, I applied stricter filtering parameters and selected the proper model (by mistake, I had as ClairS-TO input "sup" model even though my data was called with "hac" mode). Now I can see the expected deletions ;)

By the way, which filtering parameters do you recommend for fastq pre-processing?

Here is the example of a region with known deletion with relaxed filtering parameters (q >10)
deletion

Here is the same sample, but (q > 13). I lost a lot of reads though.
image

@aquaskyline
Copy link
Member

Great that you rediscovered the deletions. In terms of fastq pre-processing parameters, the diversity of tumor, sequencing setup, and ONT data itself actually favors no single perfect preset. So it is worth some tuning efforts on a big batch of samples. But if one is working on a few samples, the default usually works pretty well already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants