Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistString impl to Uniform and Slice #1315

Merged
merged 2 commits into from
Jul 14, 2023

Conversation

aobatact
Copy link
Contributor

Uniform<char> and Slice<char> can be useful with DistString to generate a String with range or set or chars.

Question

  • Currently impl DistString for Slice<'a, char> checks the max_char_len to minimize the reserve length, but do we need this?

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR

Comment on lines 126 to 136
let max_char_len = self
.slice
.iter()
.try_fold(1, |max_len, char| {
// When the current max_len is 4, the result max_char_len will be 4.
Some(max_len.max(char.len_utf8())).filter(|len| *len < 4)
})
.unwrap_or(4);

string.reserve(max_char_len * len);
string.extend(self.sample_iter(rng).take(len))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the slice is large, this could take significant time (and the only purpose is to potentially reduce the max-char-len used for space reservation below 4). I suggest only iterating when the length is under some bound (1000 maybe?).

Further, a slice could contain one 4-byte character but mostly 1-byte chars so this could massively over-reserve (perhaps significant if len is large). It isn't obvious what the best approach is (or how to test — it could be very context dependent — perhaps try an approach which isn't terrible anywhere rather than to perfectly optimise).

I mean it's arguable whether just reserving len bytes would be better. Or you could sample len/4, check the length and repeat... but probably over-complex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal here? How many allocations are acceptable? We could always have conservative lower limit and let extend deal with the additional allocations. If we reserve the minimal len/4, this should result in at most 3 allocations, right? I think this is good enough for a "general-purpose" implementation.

Are there use cases where the slice is large? Typically, I would expect the "alphabet" to be small, unless it is abused for weighted sampling, for which we have better implementations. Another use case could be "valid Unicode except a few characters", but for this, rejection sampling should be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, I would expect the "alphabet" to be small

As would I, but it could be large and this is easy to test. The simplest solution would be to measure max-char-size only where the alphabet is reasonably small, e.g. under 200 items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed out, I implemented not checking for long slices. The limit is currently 200 but I don't know what is the best size for this.
I also split the sampling if the sampling length is long or the slice contains not only ascii.

// Get the utf8 length of hi to minimize extra space.
// SAFETY: hi used to be valid char.
// This relies on range constructors which accept char arguments.
let max_char_len = unsafe { char::from_u32_unchecked(hi).len_utf8() };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe char::len_utf8 even cares whether hi is a valid char so this should be safe, but the use of unsafe still feels unnecessary. Suggestion:

let max_char_len = char::from_u32(hi).map(char::len_utf8).unwrap_or(4);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have applied your suggestion.

Comment on lines 852 to 856
// Getting the hi value to assume the required length to reserve in string.
let mut hi = self.0.sampler.low + self.0.sampler.range;
if hi >= CHAR_SURROGATE_START {
hi += CHAR_SURROGATE_LEN;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largest possible result is one less than this I think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some tests for such corner cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fix this and add some test.

@aobatact
Copy link
Contributor Author

Thank you for review!

@vks vks added the D-review Do: needs review label Jul 4, 2023
Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dhardy dhardy merged commit ee80b41 into rust-random:master Jul 14, 2023
@aobatact aobatact deleted the more-dist-string branch July 19, 2023 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D-review Do: needs review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants