Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base mod tweaks to improve validity checking. #1749

Merged
merged 4 commits into from
Feb 23, 2024

Conversation

jkbonfield
Copy link
Contributor

We may perhaps want to have a control over whether we hard error or not? If so maybe add some mitigation (nullifying) instead?

sam_mods.c Outdated
for (i = 0; i < state->nmods; i++) {
// Check if any remaining items in MM after hitting the end
// of the sequence.
if (!b->core.l_qseq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this check be best done outside the loop?

@daviesrob
Copy link
Member

It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in MM-orient.sam results in no detection on the forward strand, but a complaint for the reverse. Is that expected?

Also, while playing with this, I noticed that both pileup_mod and test_mod leak memory on failure. In the latter case it's due to a missing sam_close() in the failure clean-up path. pileup_mod is used in some tests with a non-zero return. Unfortunately it's not possible to see the leak for them in the CI tests because stderr gets redirected by the test harness, and we don't distinguish the leak-sanitizer non-zero return from the one we expect. It is possible to see when running by hand though:

$ ./test/pileup_mod test/base_mods/MM-MNf1.sam 
[E::bam_parse_basemod2] r1: MM/MN data length is incompatible with SEQ length

=================================================================
==20713==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 11536 byte(s) in 1 object(s) allocated from:
    #0 0x7f65233df7f7 in __interceptor_calloc ../../../../gcc-12.1.0/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x405bf4 in pileup_cd_create test/pileup_mod.c:75

SUMMARY: AddressSanitizer: 11536 byte(s) leaked in 1 allocation(s).

Also, while fiddling with this I accidentally ran the wrong program (test_mod) on the test/base_mods/MM-pileup.sam file, and got this:

==19796==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c00000003f at pc 0x0000004902b8 bp 0x7ffc2c9086f0 sp 0x7ffc2c9086e8
READ of size 1 at 0x60c00000003f thread T0
    #0 0x4902b7 in bam_mods_at_next_pos sam_mods.c:504
    #1 0x492f39 in bam_next_basemod sam_mods.c:611
    #2 0x406488 in main ./test/test_mod.c:187
    #3 0x7f1338598c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #4 0x407079 in _start (./test/test_mod+0x407079)

0x60c00000003f is located 1 bytes to the left of 128-byte region [0x60c000000040,0x60c0000000c0)
allocated by thread T0 here:
    #0 0x7f133a609d15 in __interceptor_realloc ../../../../gcc-12.1.0/libsanitizer/asan/asan_malloc_linux.cpp:85
    #1 0x45896c in sam_realloc_bam_data sam.c:446

which suggests a more worrying problem somewhere...

@jkbonfield
Copy link
Contributor Author

I'll look at the other programs once I've finished my current PR, but in answer to the first question:

It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in MM-orient.sam results in no detection on the forward strand, but a complaint for the reverse. Is that expected?

Yes. It's a best efforts thing basically. We can detect errors on the reverse strand because it means parsing all the way through the base modification string and reading it backwards. Hence we detect an overflow case right at the start. We could obviously do that full parse everywhere, but that would slow things down. The strategy I took was report errors as we find them rather than a full validation pass every time.

If you call a bam_next_basemod iterator which is what my new samtools fixmate -M code is doing then you'll trigger the error at the other end too. It's possible the API being used by the test harness is taking a different route somewhere.

When we fail with "Corrupted aux data" it's often useful to know the
BAM flags as well as the read name as this often helps disambiguate
which record for this template is the problematic read.
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Feb 20, 2024

I think what I have should now be passing the memory checkers. I still need to think of more examples, but there are 2 dedicated test files already whose purposes was specifically to test MM strings referring to bases off the ends of sequences. Both test_mod and pileup_mod report errors and return with non-zero exit status, confirming the tests work.

What extra tests were you thinking of?

@daviesrob
Copy link
Member

The memory checkers are much happier.

For invalid inputs, there's test/base_mods/MM-MNf1.sam and test/base_mods/MM-MNf1.sam, but they both fail with "MM/MN data length is incompatible with SEQ length". I don't think there's anything to test the "MM tag refers to bases beyond sequence length" case you're looking for here (or the equivalent on the reverse strand).

@jkbonfield
Copy link
Contributor Author

The memory checkers are much happier.

For invalid inputs, there's test/base_mods/MM-MNf1.sam and test/base_mods/MM-MNf1.sam, but they both fail with "MM/MN data length is incompatible with SEQ length". I don't think there's anything to test the "MM tag refers to bases beyond sequence length" case you're looking for here (or the equivalent on the reverse strand).

MM being incompatible with SEQ length does mean MM refers to data beyond the end of the sequence. So it's the same thing.

Added MM overflow detection for +ve strand data.  It was already
detected on -ve strand as this is done in the initial parse loop.
We only detect +ve strand overflow once we hit the end of the
iterator, but it's still sufficient for validation.

Also improve MM parsing when faced with empty lists, such as
"C+m;C+h,0".  These parsed correctly before, but left
state->MM[0] pointing at the "C" rather than ";" which makes overflow
detection harder.
Similarly for the base mod state on failure in pileup_mod.
This frees memory when destroying earlier than expected, such as
during a processing failure.

I can't figure out how this has been missed all these years!
@jkbonfield
Copy link
Contributor Author

I added MM overflow errors too in the tests. I also moved the code performing this test, as as reported it didn't trigger on all cases. I think when I wrote it I was maybe testing a base modification right at the end of the sequence and also one beyond the end, but I can't be sure now. This does appear to work though and it still fixes things in mpileup -M.

@daviesrob daviesrob merged commit 7db7e83 into samtools:develop Feb 23, 2024
9 checks passed
jmarshall added a commit to pysam-developers/pysam that referenced this pull request May 25, 2024
In particular, remove an excess ML entry (",169") that is rejected
as an error by bam_parse_basemod2() since samtools/htslib#1749, which
appeared in HTSlib 1.20. Fixes #1291.

Also clean the generated MM-*.bam files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants