Escape combining characters in char::Debug #49283

varkor · 2018-03-22T18:17:01Z

Although combining characters are technically printable, they make little sense to print on their own with Debug: it'd be better to escape them like non-printable characters.

This is a breaking change, but I imagine the fact escape_debug is rare and almost certainly primarily used for debugging that this is an acceptable change.
Resolves #41922.

r? @alexcrichton
cc @clarcharr

clarfonthey · 2018-03-22T23:42:53Z

LGTM although the character ranges should probably be generated with a script instead of hard-coded, as they can change across unicode versions.

varkor · 2018-03-23T00:04:51Z

@clarcharr: have you got any ideas for how to detect ranges programmatically? Apart from parsing the Unicode symbol names (which does not seem robust), I couldn't find any reasonable suggestion for how to do this automatically.

alexcrichton · 2018-03-23T19:24:03Z

r? @SimonSapin

SimonSapin · 2018-03-23T20:09:06Z

This PR is modifying the char::escape_debug. When calling this method directly or through impl Debug for char, there is only one code point being printed so the idea of escaping code points that do not make much sense on their own seems fine.

However this method is also used in impl Debug for str and in str::escape_debug. #41922 itself does not expect the behavior to change there and I agree. However that means making them based on something other than char::escape_debug. Should that other thing be exposed in new public API?

Regarding this being a breaking change, I think I’d be ok with documenting that the exact set of code points being escaped or not is not stable outside of the ASCII range.

The current doc-comment is already out of sync with the implementation: it claims that anything non-ASCII is escaped, which hasn’t been the case since #34485.

I also agree with not hard-coding code point ranges, and instead extracting more Unicode data in src/libstd_unicode/unicode.py. However it’s unclear to me what exactly should be included. Canonical_Combining_Class is about reordering during normalization and probably not what we want. The Diacritic property is defined as:

Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.

Then there’s the *Mark properties:

Abbr Long Description

Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)

Mc Spacing_Mark a spacing combining mark (positive advance width)

Me Enclosing_Mark an enclosing combining mark

M Mark Mn | Mc | Me

However #41922 suggests that only nonspacing marks are problematic?

clarfonthey · 2018-03-23T21:12:03Z

I'd be fine with just going for nonspacing marks.

As for str::escape_debug goes, I'd personally opt for escaping any combination of nonspacing marks at the beginning of the string, although it's unclear if that would actually solve the problem. I'm fine with just leaving it with the current behaviour for now.

varkor · 2018-03-26T21:55:50Z

However that means making them based on something other than char::escape_debug. Should that other thing be exposed in new public API?

Yeah, I'm not sure about this. It'd be nicer if there was an extra argument (single: bool), so the behaviour could be different for individual characters, but we obviously can't change the signature. A specific public extra method that just special-cases non-spacing combining characters seems a little specialised, though.

The current doc-comment is already out of sync with the implementation: it claims that anything non-ASCII is escaped, which hasn’t been the case since #34485.

Which doc comment are you referring to here?

I also agree with not hard-coding code point ranges, and instead extracting more Unicode data in src/libstd_unicode/unicode.py

I have a change that uses Mn instead of hard-coding ranges, pending the decision about the method / effect on str impls.

varkor · 2018-04-04T21:33:05Z

@SimonSapin: as far as I can tell, there's an issue using src/libstd_unicode/unicode.py here, as it's not available for core, which is where the formatting trait is defined. Is it worth moving it, or taking a different approach?

clarfonthey · 2018-04-05T02:43:52Z

Wasn't the goal of having libstd_unicode separate from libcore to avoid having to include Unicode tables in libcore? Which makes it suddenly interesting considering how the changes to escape_debug make that less necessary. Or, I'm probably missing something.

Personally I really dislike the libstd_unicode façade, as it tends to get way less attention than libcore, liballoc, and libstd, which leads to bugs like #44303. I'm still confused why it exists and I don't think I've actually seen it explained somewhere.

SimonSapin · 2018-04-05T17:33:41Z

#49698 merges std_unicode into core.

It might be interesting to replace printable.py (formerly char_private.py) and unicode.py with something based on https://github.com/BurntSushi/ucd-generate.

bors · 2018-04-12T03:01:16Z

☔ The latest upstream changes (presumably #49698) made this pull request unmergeable. Please resolve the merge conflicts.

varkor · 2018-04-12T11:26:39Z

@SimonSapin: #49698 makes everything much more pleasant :)

I've rebased on top of it (and also cleaned up unicode.py whilst I was at it, because it had some slightly annoying behaviour before).

SimonSapin · 2018-04-12T13:29:31Z

@varkor Thanks for your work on this. However I’m still not sure what the right thing to do here is, in terms of what characters exactly to escape or not in which context. Do you know what other languages do for debug-printing Unicode strings and characters?

varkor · 2018-04-12T13:35:18Z

@SimonSapin: Swift does something similar to the proposed change here. I haven't tried working out exactly which characters they choose to escape, but it seems reasonable to assume they choose either the same, or a similar category.

varkor · 2018-04-12T20:36:26Z

Ah, so Swift actually escapes any non-ASCII character, which is what Rust used to do before #24588:
https://github.com/apple/swift/blob/c35d508600516b892732e2fd3f0f0a17ca4562ba/stdlib/public/core/UnicodeScalar.swift#L197-L252
So this is new territory, and it makes sense to go with what we think is most sensible: in this case, I think the suggestion of Mn is a reasonable choice.

TimNN · 2018-04-17T16:03:15Z

@SimonSapin: Friendly triage ping :)

…private

SimonSapin · 2018-05-21T18:21:46Z

@bors r+

bors · 2018-05-21T18:21:47Z

📌 Commit b653937 has been approved by SimonSapin

bors · 2018-05-21T23:26:38Z

⌛ Testing commit b653937 with merge 65a16c0...

@alexcrichton

Escape combining characters in char::Debug Although combining characters are technically printable, they make little sense to print on their own with `Debug`: it'd be better to escape them like non-printable characters. This is a breaking change, but I imagine the fact `escape_debug` is rare and almost certainly primarily used for debugging that this is an acceptable change. Resolves #41922. r? @alexcrichton cc @clarcharr

bors · 2018-05-22T01:53:46Z

☀️ Test successful - status-appveyor, status-travis
Approved by: SimonSapin
Pushing 65a16c0 to master...

rust-highfive assigned alexcrichton Mar 22, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 22, 2018

kennytm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 23, 2018

rust-highfive assigned SimonSapin and unassigned alexcrichton Mar 23, 2018

clarfonthey mentioned this pull request Apr 6, 2018

Merge the std_unicode crate into the core crate #49698

Merged

varkor force-pushed the combining-chars-escape_debug branch from fb39348 to a1c6deb Compare April 12, 2018 11:25

varkor force-pushed the combining-chars-escape_debug branch 2 times, most recently from 672b39f to 59513ad Compare April 12, 2018 23:05

rust-lang deleted a comment from rust-highfive Apr 12, 2018

varkor force-pushed the combining-chars-escape_debug branch from 59513ad to c85cc88 Compare April 13, 2018 17:37

rust-lang deleted a comment from rust-highfive Apr 13, 2018

varkor force-pushed the combining-chars-escape_debug branch from c85cc88 to e96a115 Compare April 13, 2018 20:26

rust-lang deleted a comment from rust-highfive Apr 13, 2018

rust-lang deleted a comment from rust-highfive Apr 17, 2018

SimonSapin added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 21, 2018

varkor added 12 commits May 21, 2018 18:57

Update .gitignore for libstd_unicode

9802300

Update unicode/tables.rs with Mn

f53022f

Download unicode data files in directory of unicode.py

a0b5d38

Keep tables.rs copyright notice up to date

b72faf5

Escape combining characters in escape_debug

4694d20

Add test for Debug formatting of char

699a2b5

Remove example in test for is_nonspacing_mark because it's currently …

68c4fb8

…private

Use the correct output directory for downloading Unicode files

d3c257b

Use Grapheme_Extend instead of Mn

d7aa35e

Make {char, str}::escape_debug and impl Debug for {char, str} consistent

8c89e7f

Only escape extended grapheme characters in the first position

c51f002

Avoid counting characters and add explanatory comment to test

2fa22ef

varkor force-pushed the combining-chars-escape_debug branch from 3591ecd to 2fa22ef Compare May 21, 2018 18:00

Fix tables.rs

b653937

SimonSapin approved these changes May 21, 2018

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 21, 2018

bors merged commit b653937 into rust-lang:master May 22, 2018

bors mentioned this pull request May 22, 2018

unicode.py refactor part 1 #50922

Closed

varkor deleted the combining-chars-escape_debug branch May 22, 2018 09:21

Mark-Simulacrum mentioned this pull request Jun 27, 2018

Test suites of xml-rs and quick-xml crates fails on 1.28 beta #51843

Closed

max-sixty mentioned this pull request Jul 24, 2019

Confusing Debug of chars #62947

Closed

Mark-Simulacrum mentioned this pull request Mar 27, 2020

Shrink Unicode tables (even more) #70486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape combining characters in char::Debug #49283

Escape combining characters in char::Debug #49283

varkor commented Mar 22, 2018

clarfonthey commented Mar 22, 2018

varkor commented Mar 23, 2018

alexcrichton commented Mar 23, 2018

SimonSapin commented Mar 23, 2018

clarfonthey commented Mar 23, 2018

varkor commented Mar 26, 2018

varkor commented Apr 4, 2018

clarfonthey commented Apr 5, 2018

SimonSapin commented Apr 5, 2018

bors commented Apr 12, 2018

varkor commented Apr 12, 2018

SimonSapin commented Apr 12, 2018

varkor commented Apr 12, 2018

varkor commented Apr 12, 2018

TimNN commented Apr 17, 2018

SimonSapin commented May 21, 2018

bors commented May 21, 2018

bors commented May 21, 2018

bors commented May 22, 2018

Escape combining characters in char::Debug #49283

Escape combining characters in char::Debug #49283

Conversation

varkor commented Mar 22, 2018

clarfonthey commented Mar 22, 2018

varkor commented Mar 23, 2018

alexcrichton commented Mar 23, 2018

SimonSapin commented Mar 23, 2018

clarfonthey commented Mar 23, 2018

varkor commented Mar 26, 2018

varkor commented Apr 4, 2018

clarfonthey commented Apr 5, 2018

SimonSapin commented Apr 5, 2018

bors commented Apr 12, 2018

varkor commented Apr 12, 2018

SimonSapin commented Apr 12, 2018

varkor commented Apr 12, 2018

varkor commented Apr 12, 2018

TimNN commented Apr 17, 2018

SimonSapin commented May 21, 2018

bors commented May 21, 2018

bors commented May 21, 2018

bors commented May 22, 2018