Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

Open
jonathandmoore opened this issue Oct 18, 2023 · 3 comments

Comments

@jonathandmoore
Copy link

jonathandmoore commented Oct 18, 2023

I searched for a particular ORF sequence against a given database, using ssearch36.

fasta36 -E 10 -f 10 -g 2 orf.fasta library.fasta 2

Then I searched for a multi-FASTA file containing many ORFs, including the original one, against the same database.

fasta36 -E 10 -f 10 -g 2 lots_of_orfs.fasta library.fasta 2

I get similar hits, but the hits have different bit scores and evalues, I think driven by the different statistics. Is this a 'feature'?

Example outputs illustrating the problem, this result for a single ORF:

   539214 residues in  2631 sequences

Statistics:  Expectation_n fit: rho(ln(x))= 8.6726+/-0.00279; mu= -2.9805+/- 0.107
 mean_var=53.5146+/-12.132, 0's: 0 Z-trim(86.9): 21  B-trim: 21 in 1/42
 Lambda= 0.175323
 statistics sampled from 722 (728) to 722 sequences
Algorithm: FASTA (3.8 Nov 2011) [optimized]
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
 ktup: 2, E-join: 1 (0.559), E-opt: 0.2 (0.277), width:  16
 Scan time:  0.030

This result for the same ORF as part of a list of searches:

   539214 residues in  2631 sequences

Statistics:  Expectation_n fit: rho(ln(x))= 8.7380+/-0.0029; mu= -3.1580+/- 0.110
 mean_var=64.3984+/-15.325, 0's: 0 Z-trim(91.3): 11  B-trim: 24 in 1/42
 Lambda= 0.159822
 statistics sampled from 725 (728) to 725 sequences
Algorithm: FASTA (3.8 Nov 2011) [optimized]
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
 ktup: 2, E-join: 1 (0.778), E-opt: 0.2 (0.49), width:  16
 Scan time:  0.030
@wrpearson
Copy link
Owner

wrpearson commented Oct 18, 2023 via email

@jonathandmoore
Copy link
Author

jonathandmoore commented Oct 18, 2023

Thanks for the super-quick reply. My doubt was that both of my searches use exactly the same library, but one has a single query sequence and the other has multiple query sequences including the single sequence. It seems that the statistics are not just dependant on the library, but on the query sequences - the same query sequence gets different statistics depending on which other query sequences it is submitted with, even though the library does not change. Hope this is clear.

If protein A is searched against library P, it gets different scores than it gets if protein A and protein B are searched against library P.

@wrpearson
Copy link
Owner

wrpearson commented Oct 21, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants