Tracking Issue for unicode and escape codes in literals #116907

traviscross · 2023-10-18T22:03:48Z

This is a tracking issue for the RFC 3349 (rust-lang/rfcs#3349).

The feature gate for the issue is #![feature(mixed_utf8_literals)].

From the RFC:

Relax the restrictions on which characters and escape codes are allowed in string, char, byte string, and byte literals.

Most importantly, this means we accept the exact same characters and escape codes in "…" and b"…" literals. That is:

Allow unicode characters, including \u{…} escape codes, in byte string literals. E.g. b"hello\xff我叫\u{1F980}"

Also allow non-ASCII \x… escape codes in regular string literals, as long as they are valid UTF-8. E.g. "\xf0\x9f\xa6\x80"

About tracking issues

Tracking issues are used to record the overall progress of implementation. They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions. A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature. Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

Steps

Implement the RFC
Adjust documentation (see instructions on rustc-dev-guide)
Stabilization PR (see instructions on rustc-dev-guide)

Unresolved Questions

Should concat!("\xf0\x9f", "\xa6\x80") work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.)

The text was updated successfully, but these errors were encountered:

traviscross · 2023-10-18T22:04:12Z

@rustbot labels +T-lang

traviscross · 2023-10-19T02:16:46Z

@rustbot labels +B-rfc-approved

nnethercote · 2023-12-06T02:23:48Z

I would like to take this one.

nnethercote · 2023-12-13T02:15:58Z

I have a partial implementation of this RFC working locally (EDIT: now at #120286). The RFC proposes five changes to literal syntax. I think three of them are good, and two of them aren't necessary.

`b""`: add unicode chars

Adding them fixes the first of two cases where b"" syntax isn't a superset of "" syntax. This is good, and facilitates "conventionally UTF-8" string literals.

`br""`: add unicode chars

Adding them fixes the one case where rb"" syntax isn't a superset of r"" syntax. After this, rb"" syntax and r"" syntax are the same. This is good, and also facilitates "conventionally UTF-8" string literals.

`b""`: add `\u{NN}` escapes

Adding them fixes the second of two cases where b"" syntax isn't a superset of "" syntax, and fits well with adding unicode chars. This is good.

Note: After adding this, the one thing b"" syntax has that "" syntax does not is \x80-\xff bytes.

`""`: add `\x80-\xff`

Is this necessary? What useful new functionality does this provide?

It would make "" and b"" syntax identical, but strings and byte strings aren't identical types, so that identicalness isn't needed.

The RFC says "Allowing all characters and all known escape codes in both types of string literals reduces the complexity of the language. We'd no longer have different escape codes for different literal types. We'd only require regular string literals to be valid UTF-8." So it has just traded one exception for another. IMO that's not a simplification.

It's odd that it would be possible to write a "" that isn't valid UTF-8... both conceptually, and in the implementation. For the latter you can no longer start with an empty String and append chars one at a time knowing it'll be valid UTF-8 the whole way, which is how it's currently handled. Instead you need to start with a Vec<u8>, append chars as byte sequences, and then UTF-8 validate at the end. It's not that difficult, but it's not needed for any other literal kind, and weird enough that, combined with the other points above, makes me question it.

Not doing this keeps "" syntax consistent with '', which makes sense given that "" and '' are both unicode-oriented rather than byte-oriented. This is another refutation of the complexity argument above.

Not doing this was suggested in the "Alternatives" section of the RFC.

Not doing this also renders moot the unresolved question of what to do with concat!("\xf0\x9f", "\xa6\x80").

`b''`: add `\u{00}-\u{7f}`

Is this necessary? It doesn't provide any useful new functionality.

The \x syntax is strictly more powerful, covering the range \x00-0xff. And supporting just the ASCII subset of \u escapes doesn't match behaviour of any of the other literal syntaxes. Byte literals are about a single byte, why introduce Unicode-related stuff?

The quote from the RFC I mentioned above about complexity applies again, but again, it's just trading one exception for another.

cc @rust-lang/lang @m-ou-se

nnethercote · 2023-12-13T02:24:36Z

Here's an alternative version of the table that I've been using and found helpful. It shows all the escapes directly instead of grouping them by name, it shows the changes proposed by the RFC (affected literal kinds have two lines connected by a -->, where the second line shows what changed), and it includes C string literals. The proposed changes I don't like are marked with ?.

        chars    escapes                                        mixed utf8
        -----    -------                                        ----------
- ''    unicode  \' \" \n \r \t \\ \0 \x00-\x7f \u{..}          no
    
- b''   ascii    \' \" \n \r \t \\ \0 \x00-\xff                 no    
  -->                                           \u{0}..\u{7f}?  yes?
    
- ""    unicode  \' \" \n \r \t \\ \0 \x00-\x7f \u{..}          no
  -->                                 \x00-\xff?                yes?

- r""   unicode  N/A                                            no

- b""   ascii    \' \" \n \r \r \\ \0 \x00-0xff                 no
  -->   unicode                                 \u{..}          yes
    
- br""  ascii    N/A                                            no
  -->   unicode
  
- c""   unicode  \' \" \n \r \t \\ __ \x01-0xff \u{..}          yes

- cr""  unicode  N/A                                            no

This makes it easier to see things like adding \x80-\xff to "" syntax would make it identical to b"" syntax, but also make "" syntax different to '' syntax.

nnethercote · 2023-12-13T02:31:05Z

BTW, I have implemented the first three changes. They were pretty easy, and piggy-backed naturally off the existing support for mixed utf8 in C string literals, requiring only minor changes.

I haven't implemented the last two. They would both have required new kinds of checks, somewhat annoying to implement, which is what got me thinking about whether they are necessary.

nnethercote · 2024-01-25T03:30:36Z

BTW, I have implemented the first three changes

A complete draft implementation is now at #120286.

…, r=<try> Implement RFC 3349, mixed utf8 literals RFC: rust-lang/rfcs#3349 Tracking issue: rust-lang#116907 r? `@ghost`

nnethercote · 2024-01-26T21:52:25Z

Nominated for lang-team discussion for this comment above.

joshtriplett · 2024-01-28T19:04:12Z

cc @m-ou-se, who may want to provide input/responses to the above.

joshtriplett · 2024-01-31T16:45:27Z

@nnethercote FWIW, I do feel like having \u{00}-\u{7f} in b'...' is a clear win: if we allow it in b"...", we should also allow it in b'...' as well.

nnethercote · 2024-01-31T21:49:23Z

@nnethercote FWIW, I do feel like having \u{00}-\u{7f} in b'...' is a clear win: if we allow it in b"...", we should also allow it in b'...' as well.

Is this a consistency argument? Consider the table. Currently some literals don't support \u escapes at all, while some support \u escapes fully. The proposal is to add a third category, \u{00}..\u{7f}, which would only apply to b''. I don't think that's a consistency improvement!

Or maybe it's a Postel's law style "we should accept anything that makes sense" argument? If so, I would immediately ask why? The \xx form is inherently superior for a literal that defines a single byte, because (a) it's shorter, (b) it covers the full range 0-255, (c) it's naturally byte-oriented and therefore a better conceptual fit than a unicode-oriented escape.

traviscross added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Oct 18, 2023

rustbot added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 18, 2023

rustbot added the B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. label Oct 19, 2023

traviscross mentioned this issue Oct 19, 2023

RFC: Unicode and escape codes in literals rust-lang/rfcs#3349

Merged

traviscross added the F-mixed_utf8_literals #![feature(mixed_utf8_literals)] label Nov 4, 2023

LukasKalbertodt mentioned this issue Nov 10, 2023

Update escape logic according to RFC 3349 LukasKalbertodt/litrs#16

Open

madsmtm mentioned this issue Nov 15, 2023

Tracking Issue for c"…" string literals #105723

Closed

12 tasks

nnethercote self-assigned this Dec 6, 2023

nnethercote mentioned this issue Jan 12, 2024

Delay literal unescaping #118699

Closed

nnethercote mentioned this issue Jan 23, 2024

Implement RFC 3349, mixed utf8 literals #120286

Draft

bors added a commit to rust-lang-ci/rust that referenced this issue Jan 25, 2024

Auto merge of rust-lang#120286 - nnethercote:3349-mixed-utf8-literals…

b626f8d

…, r=<try> Implement RFC 3349, mixed utf8 literals RFC: rust-lang/rfcs#3349 Tracking issue: rust-lang#116907 r? `@ghost`

nnethercote added the I-lang-nominated Nominated for discussion during a lang team meeting. label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue for unicode and escape codes in literals #116907

Tracking Issue for unicode and escape codes in literals #116907

traviscross commented Oct 18, 2023

traviscross commented Oct 18, 2023

traviscross commented Oct 19, 2023

nnethercote commented Dec 6, 2023

nnethercote commented Dec 13, 2023 •

edited

Loading

nnethercote commented Dec 13, 2023 •

edited

Loading

nnethercote commented Dec 13, 2023 •

edited

Loading

nnethercote commented Jan 25, 2024

nnethercote commented Jan 26, 2024

joshtriplett commented Jan 28, 2024

joshtriplett commented Jan 31, 2024

nnethercote commented Jan 31, 2024

Tracking Issue for unicode and escape codes in literals #116907

Tracking Issue for unicode and escape codes in literals #116907

Comments

traviscross commented Oct 18, 2023

About tracking issues

Steps

Unresolved Questions

traviscross commented Oct 18, 2023

traviscross commented Oct 19, 2023

nnethercote commented Dec 6, 2023

nnethercote commented Dec 13, 2023 • edited Loading

b"": add unicode chars

br"": add unicode chars

b"": add \u{NN} escapes

"": add \x80-\xff

b'': add \u{00}-\u{7f}

nnethercote commented Dec 13, 2023 • edited Loading

nnethercote commented Dec 13, 2023 • edited Loading

nnethercote commented Jan 25, 2024

nnethercote commented Jan 26, 2024

joshtriplett commented Jan 28, 2024

joshtriplett commented Jan 31, 2024

nnethercote commented Jan 31, 2024

nnethercote commented Dec 13, 2023 •

edited

Loading

`b""`: add unicode chars

`br""`: add unicode chars

`b""`: add `\u{NN}` escapes

`""`: add `\x80-\xff`

`b''`: add `\u{00}-\u{7f}`

nnethercote commented Dec 13, 2023 •

edited

Loading

nnethercote commented Dec 13, 2023 •

edited

Loading