Documentation: training data to create a model #14

RunningMatcha · 2024-06-22T06:07:16Z

Thanks for developing such a great tool! I am currently testing your variant caller with ONT data originated from different organisms, and it could nicely recognize SNPs but I am having troubles with short deletions.
I was wondering whether it makes sense for me to train it with my own data to create a model. I saw that for Clair3 it is explained in the documentation how to train data, but I did not find the documentation for ClairS-TO.
If this is possible, would you please add in the documentation how users can create their own models?

Thank you very much!

aquaskyline · 2024-06-22T07:17:57Z

Preparing training data for ClairS and S-TO is much more complicated than Clair3 because it uses synthetic data, and it seldom improves the calling performance because of the scale and heterogeneity needed in training data. On the other hand, you mentioned missing short deletions, would it be possible for you to send us some IGV screenshots of the missing short deletions so we could see if there is a solution.

RunningMatcha · 2024-06-24T10:56:49Z

Hi aquaskyline.

Thank you for your quick reply!
I was perhaps too naive and thought that if we work with synthetic DNA the accuracy would improve if we train ClairS-TO with our sequences (not always derived from human DNA). Have you tested ClairS-TO with different organisms?
After your reply, I applied stricter filtering parameters and selected the proper model (by mistake, I had as ClairS-TO input "sup" model even though my data was called with "hac" mode). Now I can see the expected deletions ;)

By the way, which filtering parameters do you recommend for fastq pre-processing?

Here is the example of a region with known deletion with relaxed filtering parameters (q >10)

Here is the same sample, but (q > 13). I lost a lot of reads though.

aquaskyline · 2024-06-24T11:12:46Z

Great that you rediscovered the deletions. In terms of fastq pre-processing parameters, the diversity of tumor, sequencing setup, and ONT data itself actually favors no single perfect preset. So it is worth some tuning efforts on a big batch of samples. But if one is working on a few samples, the default usually works pretty well already.

aquaskyline closed this as completed Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: training data to create a model #14

Documentation: training data to create a model #14

RunningMatcha commented Jun 22, 2024

aquaskyline commented Jun 22, 2024

RunningMatcha commented Jun 24, 2024

aquaskyline commented Jun 24, 2024

Documentation: training data to create a model #14

Documentation: training data to create a model #14

Comments

RunningMatcha commented Jun 22, 2024

aquaskyline commented Jun 22, 2024

RunningMatcha commented Jun 24, 2024

aquaskyline commented Jun 24, 2024