Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Multicharacter support #466

Open
boyter opened this issue May 20, 2024 · 4 comments
Open

BUG: Multicharacter support #466

boyter opened this issue May 20, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@boyter
Copy link
Owner

boyter commented May 20, 2024

Describe the bug

The addition of wenyan language via #465 highlights an issue in how scc matches. If you run it against any wenyan file it will not count the complexity. This is down to how the scc state machine works where it is matching on characters not runes.

To Reproduce

go run . ./examples/language/wenyan.wy

from inside the scc directory on a recent checkout.

Expected behavior

It is expected that there will be some complexity counted here.

@boyter boyter added the bug Something isn't working label May 20, 2024
@dbaggerman
Copy link
Collaborator

It's been a while since I touched the code, but from what I remember the matching was done on bytes rather than characters. It should be able to match unicode byte sequences just as well as ASCII ones.

However, that relies on both languages.json and the source code using the same encoding. If they're both utf8 encoded with the same byte sequence then in theory it should be able to match them. On the other hand, if one is utf8 and the other is utf16 the underlying binary representation won't match so a byte sequence comparison will fail.

@boyter
Copy link
Owner Author

boyter commented May 28, 2024

Yep you are correct. That's the exact reason for it.

Where I am debating is if it should be fixed in the state machine OR there be the ability to add new state machines for individual languages, which is something I wanted anyway in order to resolve some annoying issues, and as a potential performance improvement, for common languages such as Java as they could get their own specific code path.

@dbaggerman
Copy link
Collaborator

I remember seeing comments about the idea of language specific state machines to support the case of having one language nested within another.

The current state machine is built around looking up the bytes against a fixed size 256 element array, which wouldn't adapt well to converting to unicode runes. The trees are probably not dense enough to justify a binary search within a node, so doing a linear search within each node might be the most obvious solution.

Another solution would be to use the current Trie implementation, but re-encode the language tokens into a Trie per encoding. It would probably require some refactoring to track a Trie or LanguageFeature per encoding, but might perform better than a linear search.

@boyter
Copy link
Owner Author

boyter commented May 29, 2024

The idea about the language specific state machine was yes to support that idea, as well as skip a heap of logic that does not apply where possible. For example java does not supported nested multiline comments, so skip the bookkeeping for that and hopefully gain a performance improvement. I would only be looking to do this for the most common languages though, as the generic version is much better for dealing with ad-hoc addition. It also would allow for dealing with any really oddball languages that pop up.

My initial plan is to just see what affect some rune conversion has as a baseline and go from there.

@boyter boyter added this to TODO in Release 3.4.0 Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: TODO
Development

No branches or pull requests

2 participants