Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

jonathandmoore · 2023-10-18T15:06:46Z

I searched for a particular ORF sequence against a given database, using ssearch36.

fasta36 -E 10 -f 10 -g 2 orf.fasta library.fasta 2

Then I searched for a multi-FASTA file containing many ORFs, including the original one, against the same database.

fasta36 -E 10 -f 10 -g 2 lots_of_orfs.fasta library.fasta 2

I get similar hits, but the hits have different bit scores and evalues, I think driven by the different statistics. Is this a 'feature'?

Example outputs illustrating the problem, this result for a single ORF:

   539214 residues in  2631 sequences

Statistics:  Expectation_n fit: rho(ln(x))= 8.6726+/-0.00279; mu= -2.9805+/- 0.107
 mean_var=53.5146+/-12.132, 0's: 0 Z-trim(86.9): 21  B-trim: 21 in 1/42
 Lambda= 0.175323
 statistics sampled from 722 (728) to 722 sequences
Algorithm: FASTA (3.8 Nov 2011) [optimized]
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
 ktup: 2, E-join: 1 (0.559), E-opt: 0.2 (0.277), width:  16
 Scan time:  0.030

This result for the same ORF as part of a list of searches:

   539214 residues in  2631 sequences

Statistics:  Expectation_n fit: rho(ln(x))= 8.7380+/-0.0029; mu= -3.1580+/- 0.110
 mean_var=64.3984+/-15.325, 0's: 0 Z-trim(91.3): 11  B-trim: 24 in 1/42
 Lambda= 0.159822
 statistics sampled from 725 (728) to 725 sequences
Algorithm: FASTA (3.8 Nov 2011) [optimized]
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
 ktup: 2, E-join: 1 (0.778), E-opt: 0.2 (0.49), width:  16
 Scan time:  0.030

The text was updated successfully, but these errors were encountered:

wrpearson · 2023-10-18T16:41:21Z

The statistical estimates provided by the FASTA programs (including SSEARCH) are determined empirically, by sampling up to 60,000 scores from the library searched. Since your library only has 2631 sequences, the estimates are based on all the scores that were calculated. But since your library has different sequences, the distribution of scores is slightly different, and thus the statistical estimates are slightly different. If you look at the numbers after the "Statistics:" line, you see that the rho, mu, mean_var, and Lambda are all slightly different, reflecting the different sets of scores that were found in the two searches. These parameters were determined by fitting the scores that were obtained, and are used to calculate the E()-value and bit score. I do think of it as a "feature", since the estimates reflect the properties of the database that was searched. Bill Pearson Begin forwarded message: From: Jay Moore ***@***.******@***.***>> Subject: [wrpearson/fasta36] Different results for single-sequence FASTA query than same sequene in multi-FASTA query (Issue #57) Date: October 18, 2023 at 9:06:56 AM MDT To: wrpearson/fasta36 ***@***.******@***.***>> Cc: Subscribed ***@***.******@***.***>> Reply-To: wrpearson/fasta36 ***@***.******@***.***>> I searched for a particular ORF sequence against a given database, using ssearch36. Then I searched for a multi-FASTA file containing many ORFs, including the original one, against the same database. I get similar hits, but the hits have different bit scores and evalues, I think driven by the different statistics. Is this a 'feature'? Example outputs illustrating the problem, this result for a single ORF: 539214 residues in 2631 sequences Statistics: Expectation_n fit: rho(ln(x))= 8.6726+/-0.00279; mu= -2.9805+/- 0.107 mean_var=53.5146+/-12.132, 0's: 0 Z-trim(86.9): 21 B-trim: 21 in 1/42 Lambda= 0.175323 statistics sampled from 722 (728) to 722 sequences Algorithm: FASTA (3.8 Nov 2011) [optimized] Parameters: BL50 matrix (15:-5), open/ext: -10/-2 ktup: 2, E-join: 1 (0.559), E-opt: 0.2 (0.277), width: 16 Scan time: 0.030 This result for the same ORF as part of a list of searches: 539214 residues in 2631 sequences Statistics: Expectation_n fit: rho(ln(x))= 8.7380+/-0.0029; mu= -3.1580+/- 0.110 mean_var=64.3984+/-15.325, 0's: 0 Z-trim(91.3): 11 B-trim: 24 in 1/42 Lambda= 0.159822 statistics sampled from 725 (728) to 725 sequences Algorithm: FASTA (3.8 Nov 2011) [optimized] Parameters: BL50 matrix (15:-5), open/ext: -10/-2 ktup: 2, E-join: 1 (0.778), E-opt: 0.2 (0.49), width: 16 Scan time: 0.030 — Reply to this email directly, view it on GitHub<#57>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABQYNPYYQN6AVA5AZJ6UPBTX77WBBAVCNFSM6AAAAAA6FV6HWSVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2DSOJYGI3TKNI>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jonathandmoore · 2023-10-18T17:02:00Z

Thanks for the super-quick reply. My doubt was that both of my searches use exactly the same library, but one has a single query sequence and the other has multiple query sequences including the single sequence. It seems that the statistics are not just dependant on the library, but on the query sequences - the same query sequence gets different statistics depending on which other query sequences it is submitted with, even though the library does not change. Hope this is clear.

If protein A is searched against library P, it gets different scores than it gets if protein A and protein B are searched against library P.

wrpearson · 2023-10-21T19:27:44Z

The statistics do depend on both the query and the library sequences, but searching the same library with a single query sequence or that query sequence, included in a library with other separate sequences, should produce the same results (subject to sampling the library, which only takes place with more than 60,000 sequences). If a query sequence is embedded in another sequence, the the statistics will be different. But if one search has 1 query A, and another search has 10 queries including A, then the results for A should be the same. The statistics for each query in a multi-query search are calculated independently. However, I just did a test and learned that I am mistaken -- I get slightly different results when the query sequence is part of a multi-query library. I will look into it. Bill Pearson Begin forwarded message: From: Jay Moore ***@***.******@***.***>> Subject: Re: [wrpearson/fasta36] Different results for single-sequence FASTA query than same sequene in multi-FASTA query (Issue #57) Date: October 18, 2023 at 1:02:10 PM EDT To: wrpearson/fasta36 ***@***.******@***.***>> Cc: William Pearson ***@***.******@***.***>>, Comment ***@***.******@***.***>> Reply-To: wrpearson/fasta36 ***@***.******@***.***>> Thanks for the super-quick reply. My doubt was that both of my searches use exactly the same library, but one has a single query sequence and the other has multiple query sequences including the single sequence. It seems that the statistics are not just dependant on the library, but on the query sequences - the same query sequence gets different statistics depending on which other query sequences it is submitted with, even though the library does not change. Hpoe this is clear. — Reply to this email directly, view it on GitHub<#57 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABQYNP3XUEIV43TUE23W5VDYAADRFAVCNFSM6AAAAAA6FV6HWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRYHE3TGOJSGU>. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

jonathandmoore commented Oct 18, 2023 •

edited

Loading

wrpearson commented Oct 18, 2023 via email

jonathandmoore commented Oct 18, 2023 •

edited

Loading

wrpearson commented Oct 21, 2023 via email

Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

Different results for single-sequence FASTA query than same sequene in multi-FASTA query #57

Comments

jonathandmoore commented Oct 18, 2023 • edited Loading

wrpearson commented Oct 18, 2023 via email

jonathandmoore commented Oct 18, 2023 • edited Loading

wrpearson commented Oct 21, 2023 via email

jonathandmoore commented Oct 18, 2023 •

edited

Loading

jonathandmoore commented Oct 18, 2023 •

edited

Loading