Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer.toString('utf8') appears to use wtf-8 #23280

Closed
AljoschaMeyer opened this issue Oct 5, 2018 · 10 comments
Closed

Buffer.toString('utf8') appears to use wtf-8 #23280

AljoschaMeyer opened this issue Oct 5, 2018 · 10 comments
Labels
buffer Issues and PRs related to the buffer subsystem. doc Issues and PRs related to the documentations. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed.

Comments

@AljoschaMeyer
Copy link

AljoschaMeyer commented Oct 5, 2018

The byte sequence 237, 166, 164 is not valid utf8, since it encodes a surrogate code point, which is not a valid unicode scalar value. So Buffer.from([237, 166, 164]).toString('utf8') should error. But instead, it returns a string, effectively implementing wtf-8 rather than utf-8.

Or does Buffer.toString simply not provide any validity guarantees at all, returning garbage strings if the buffer contains invalid input? In that case, please document this as expected behavior, since it makes the function completely useless for a bunch of use cases.

node -v: v10.11.0
uname -a: Linux aljoscha-laptop 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22 UTC 2018 x86_64 GNU/Linux

See also rust-lang/rust#54845

edit: This also leaks into JSON.parse, which can accept garbage strings even though ECMA-404 (the json standard prescribed for JSON.parse as defined in ECMAScript) only allows valid utf8 input.

@Fishrock123 Fishrock123 added confirmed-bug Issues with confirmed bugs. buffer Issues and PRs related to the buffer subsystem. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. question Issues that look for answers. and removed question Issues that look for answers. labels Oct 5, 2018
@AljoschaMeyer
Copy link
Author

Actually, straight up Buffer.from([255]).toString('utf8') already produces a string. Which is fine I guess, since js allows strings to contain garbage data, and the Buffer.toString does not specify failure modes (or their absence) at all. So this is not about wtf-8 at all.

This technically might not be a bug, just very surprising behavior that needs better documentation.

@mscdex
Copy link
Contributor

mscdex commented Oct 5, 2018

FWIW even the TextDecoder web API defaults to the same behavior. Also other languages/platforms seem to follow suit as well (e.g. Java 8).

@SimonSapin
Copy link

Buffer.from([255]).toString('utf8').codePointAt(0).toString(16) returns fffd, which is the correct behavior: a U+FFFD RELACEMENT CHARACTER code point is used to represent part of the input byte sequence that was ill-formed in UTF-8. Same with 237, 166, 164.

@mscdex
Copy link
Contributor

mscdex commented Oct 5, 2018

I would suggest using TextDecoder if you want to be able to catch (literally) decoding errors instead of having a string containing replacement characters.

Example:

new TextDecoder('utf-8', { fatal: true }).decode(Buffer.from([237,166,164]))
// throws exception

@AljoschaMeyer
Copy link
Author

Thanks for the suggestion. I guess this mostly closes the issue, but it might make sense to specify that behavior (inserting U+FFFD in case of decoding errors) in the documentation of Buffer.toString.

And while I'm at it: the documentation explicitly links to the constant for the maximum string length, but does not specify what actually happens if the decoded data does not fit into a js string.

@addaleax addaleax removed the confirmed-bug Issues with confirmed bugs. label Oct 5, 2018
@BridgeAR BridgeAR added doc Issues and PRs related to the documentations. good first issue Issues that are suitable for first-time contributors. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. labels Jun 11, 2019
@BridgeAR
Copy link
Member

I am marking this as good first issue to add a proper documentation for the described behavior.

@rexagod
Copy link
Member

rexagod commented Jun 15, 2019

@BridgeAR I'd like to give this a go! Can I proceed?

@BridgeAR
Copy link
Member

@rexagod please go for it!

@rexagod
Copy link
Member

rexagod commented Jun 16, 2019

Modern practice is to replace errors with a replacement character, and to ensure that systems do not interpret these replacement characters in any dangerous way[citation needed]. The errors can be detected later when it is convenient to report an error, or display as blocks when the string is drawn for the user.

@BridgeAR Other than the documentation, would it make sense for node to use a, say, Buffer.toSafeString or a similar method which throws an exception instead of the replacement characters? This would essentially focus on avoiding the usage of replacement characters in (at any index) invalid byte sequences. I should mention that I'm fairly new to the community and searching for ways to contribute. Thanks!

rexagod added a commit to rexagod/node that referenced this issue Jun 16, 2019
added documentation on evaluating legal code points,
and the behavior that stems from it otherwise.

Fixes: nodejs#23280
@Trott Trott removed the good first issue Issues that are suitable for first-time contributors. label Sep 30, 2019
@jasnell
Copy link
Member

jasnell commented Jul 6, 2020

I believe this can be closed. The Buffer docs include the following statement:

If `encoding` is `'utf8'` and a byte sequence in the input is not valid UTF-8,
then each invalid byte is replaced with the replacement character `U+FFFD`.

Which can be verified ...

C:\Users\jasne\Projects\node>node
Welcome to Node.js v13.12.0.
Type ".help" for more information.
> Buffer.from([237, 166, 164]).toString('utf8')
'���'
>

@jasnell jasnell closed this as completed Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. doc Issues and PRs related to the documentations. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed.
Projects
None yet
10 participants