Skip to content

mnemnion/runeset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Runeset: Sets of UTF-8 Characters

This library offers a compact data structure for "generalized"1 UTF-8 encoded codepoints. The design is based on an implicit data structure2, which uses @popCount and bit masking to check membership quickly, with minimal branching, and without having to decode the UTF-8 into another format (for instance, a codepoint).

This design is original, in the sense that I invented it. There may be prior art, it's remarkably difficult to search for "UTF-8 character sets" and find papers on set data structures, so I can't say with high confidence that it's truly novel; in a sense, it's an obvious extension of the widespread practice of using a pair of u64 bitmasks to detect a set of ASCII values. What I can say is: it's effective.

The RuneSet struct is just a slice of u64, offering a few variations on set membership tests, depending on how confident you are that the string it's testing is valid UTF-8, and what you would like the test to do in the event that it isn't. Also supported are the fundamental set operations: equality, subset, union, intersection, and difference. All of the three combining forms produce a new RuneSet, and must be given an Allocator.

The RuneSet is immutable by design, and as such, doesn't support adding or removing codepoints to an already-created set. This would be possible to implement in principle, but I don't happen to have a use for mutable adds and removes, and therefore didn't write them. You can create a RuneSet containing a single character, and via union or difference, either add or subtract that character, if you would like: this would create a new RuneSet with just that character added or removed.

The data structure is very fast. The test suite, with extensions, contains 5MB of sample data, with a 20MB extra collection, and performs a multitude of operations on that data, on the order of a dozen per string. In ReleaseFast mode, the main suite of 5MB concludes in roughly a second on an M1. With the annex, including fuzzing the creation function 2^24 times, a complete run is over in six seconds, give or take a few milli. About 2/3rds of that is the fuzzing.

While I have yet to set up definitive benchmarks, it's clear that membership testing using RuneSet has a throughput over 100 MB/sec on modern machines, potentially substantially higher than this. I'll update this with better numbers when I get around to benchmarking it; the focus has been on implementing the library, and assuring good test coverage.

For real character sets, it is also quite sparing of memory. The CJK Ideographs block of Unicode, consisting of 22,992 codepoints, is represented as a RuneSet in 338 words. Dense ranges such as that block can of course be tested by decoding the tested sequence into a codepoint and checking if it's in that range, but the combination of generality, compactness, and speed, offered by the RuneSet, is unique.

Speaking of test coverage, the library has 100% line coverage, which you may verify by running zig build cov, provided you have kcov installed. To run the extended suite, including the fuzzer, use zig build -Dtest-more test. This will take an appreciable amount of time in the default debug-mode builds, as it will trigger the 244 assertions which gird the library.

Roadmap

I intend to add some ability to write out the data structure, whether as binary data, Zig source code, or potentially both. As an allocating data structure, it isn't currently possible to generate a RuneSet at comptime, and many applications of them will use character sets known in advance, which may as well be built into the .rodata of programs, rather than constructed at runtime from strings.

I also plan to wrap all of the major functionality in C-compatible data structures and functions, for use wherever the C ABI is spoken.

Other than that, I consider the library fully implemented and feature-complete. Odds are good, however, that more open-source code, which makes use of the RuneSet as a foundation, will be forthcoming.

Footnotes

  1. Generalized, here, has a similar meaning to its use in WTF-8, except that RuneSet can also encode so-called "overlong" encodings, as well as surrogates (paired vs. unpaired is a meaningless distinction for a codepoint set). This falls out of the design: it's capable of encoding those sequences, and I saw no reason to impose a performance burden in order to prevent it. On the contrary, should you want to represent the set of all overlong encodings, perhaps to detect them and raise an error, a RuneSet is a fine way to do it. A set which does not contain overlong encodings or surrogates will never match against them, so this seeming laxity comes with no disadvantages. The five- and six-byte sequences from Pike and Thompson's original FSS UTF-8 of 1993 are not supported. As such, what would constitute lead bytes for those sequences are rejected as invalid. It would be possible to extend the RuneSet data structure to include these, if there were ever any point in so doing.

  2. That the RuneSet is 'implicit' doesn't make it the smallest possible encoding for every subset of UTF-8. For example, the CJK Unified Ideographs block could be represented as two numbers, and for some crafted inputs, such as one in 64 characters throughout the Unicode range, a bitmask of one bit per character would be smaller as well. It means that the structure is implicit to the data, without indirection, and that in full generality it isn't possible to represent UTF-8 sequences more compactly, compared to the information theoretic lower-bound. Our overhead is limited to one word, used to store one of the offsets as an important optimization.