Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme.split function crashes #19

Open
Hasnep opened this issue Aug 26, 2024 · 0 comments
Open

Grapheme.split function crashes #19

Hasnep opened this issue Aug 26, 2024 · 0 comments

Comments

@Hasnep
Copy link

Hasnep commented Aug 26, 2024

The Grapheme.split function crashes on some edge-cases, for example, running:

Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])

Crashes with the output:

The program crashed with:

        This is definitely a bug in the roc-lang/unicode package, caused by an unhandled edge case in grapheme text segmentation.

It is difficult to track down and catch every possible combination, so it would be helpful if you could log this as an issue with a reproduction.

Grapheme.split state machine state at the time was:
((AfterZWJ <opaque>), [8205, 4417], [ZWJ, L])

Here is the call stack that led to the crash:

        roc.panic
        Grapheme.splitHelp
        Grapheme.(anonymous function)
        Result.try
        Grapheme.split
        app.(anonymous function)
        Task.(anonymous function)
        .(anonymous function)
        rust.main

Optimizations can make this list inaccurate! If it looks wrong, try running without `--optimize` and with `--linker=legacy`

Here are a list of examples that crash this function:

Grapheme.split (Str.fromUtf8 [13, 204, 136, 225, 134, 168, 226, 128, 141, 234, 176, 129])
Grapheme.split (Str.fromUtf8 [224, 185, 131, 1, 225, 133, 160, 226, 128, 141, 224, 164, 128])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 31])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 204, 136, 205, 184])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 224, 164, 149])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 10])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 181, 142])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 204, 136, 240, 159, 135, 166])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 225, 134, 168])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 36])
Grapheme.split (Str.fromUtf8 [243, 160, 129, 174, 234, 176, 128, 226, 128, 141, 224, 164, 188])

They all contain U+200D the zero-width joiner character, so that's probably the source of the crash.

These examples were found by running the radamsa fuzzer using the examples in the GraphemeBreakTest data file. Hopefully this fuzz testing could be automated in the future as mentioned in #7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@Hasnep and others