Buffer.toString('utf8') appears to use wtf-8 #23280

AljoschaMeyer · 2018-10-05T14:01:46Z

The byte sequence 237, 166, 164 is not valid utf8, since it encodes a surrogate code point, which is not a valid unicode scalar value. So Buffer.from([237, 166, 164]).toString('utf8') should error. But instead, it returns a string, effectively implementing wtf-8 rather than utf-8.

Or does Buffer.toString simply not provide any validity guarantees at all, returning garbage strings if the buffer contains invalid input? In that case, please document this as expected behavior, since it makes the function completely useless for a bunch of use cases.

node -v: v10.11.0
uname -a: Linux aljoscha-laptop 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22 UTC 2018 x86_64 GNU/Linux

See also rust-lang/rust#54845

edit: This also leaks into JSON.parse, which can accept garbage strings even though ECMA-404 (the json standard prescribed for JSON.parse as defined in ECMAScript) only allows valid utf8 input.

The text was updated successfully, but these errors were encountered:

AljoschaMeyer · 2018-10-05T15:03:09Z

Actually, straight up Buffer.from([255]).toString('utf8') already produces a string. Which is fine I guess, since js allows strings to contain garbage data, and the Buffer.toString does not specify failure modes (or their absence) at all. So this is not about wtf-8 at all.

This technically might not be a bug, just very surprising behavior that needs better documentation.

mscdex · 2018-10-05T15:05:40Z

FWIW even the TextDecoder web API defaults to the same behavior. Also other languages/platforms seem to follow suit as well (e.g. Java 8).

SimonSapin · 2018-10-05T15:20:31Z

Buffer.from([255]).toString('utf8').codePointAt(0).toString(16) returns fffd, which is the correct behavior: a U+FFFD RELACEMENT CHARACTER code point is used to represent part of the input byte sequence that was ill-formed in UTF-8. Same with 237, 166, 164.

mscdex · 2018-10-05T15:27:55Z

I would suggest using TextDecoder if you want to be able to catch (literally) decoding errors instead of having a string containing replacement characters.

Example:

new TextDecoder('utf-8', { fatal: true }).decode(Buffer.from([237,166,164]))
// throws exception

AljoschaMeyer · 2018-10-05T15:32:36Z

Thanks for the suggestion. I guess this mostly closes the issue, but it might make sense to specify that behavior (inserting U+FFFD in case of decoding errors) in the documentation of Buffer.toString.

And while I'm at it: the documentation explicitly links to the constant for the maximum string length, but does not specify what actually happens if the decoded data does not fit into a js string.

BridgeAR · 2019-06-11T22:57:03Z

I am marking this as good first issue to add a proper documentation for the described behavior.

rexagod · 2019-06-15T05:47:24Z

@BridgeAR I'd like to give this a go! Can I proceed?

BridgeAR · 2019-06-15T08:49:46Z

@rexagod please go for it!

rexagod · 2019-06-16T02:36:42Z

Modern practice is to replace errors with a replacement character, and to ensure that systems do not interpret these replacement characters in any dangerous way[citation needed]. The errors can be detected later when it is convenient to report an error, or display as blocks when the string is drawn for the user.

@BridgeAR Other than the documentation, would it make sense for node to use a, say, Buffer.toSafeString or a similar method which throws an exception instead of the replacement characters? This would essentially focus on avoiding the usage of replacement characters in (at any index) invalid byte sequences. I should mention that I'm fairly new to the community and searching for ways to contribute. Thanks!

added documentation on evaluating legal code points, and the behavior that stems from it otherwise. Fixes: nodejs#23280

jasnell · 2020-07-06T19:13:51Z

I believe this can be closed. The Buffer docs include the following statement:

If `encoding` is `'utf8'` and a byte sequence in the input is not valid UTF-8,
then each invalid byte is replaced with the replacement character `U+FFFD`.

Which can be verified ...

C:\Users\jasne\Projects\node>node
Welcome to Node.js v13.12.0.
Type ".help" for more information.
> Buffer.from([237, 166, 164]).toString('utf8')
'���'
>

Fishrock123 added confirmed-bug Issues with confirmed bugs. buffer Issues and PRs related to the buffer subsystem. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. question Issues that look for answers. and removed question Issues that look for answers. labels Oct 5, 2018

addaleax removed the confirmed-bug Issues with confirmed bugs. label Oct 5, 2018

BridgeAR added doc Issues and PRs related to the documentations. good first issue Issues that are suitable for first-time contributors. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. labels Jun 11, 2019

rexagod added a commit to rexagod/node that referenced this issue Jun 16, 2019

doc: add documentation for invalid byte sequences

aa01448

added documentation on evaluating legal code points, and the behavior that stems from it otherwise. Fixes: nodejs#23280

rexagod mentioned this issue Jun 16, 2019

doc: add documentation for invalid byte sequences #28249

Closed

2 tasks

Trott removed the good first issue Issues that are suitable for first-time contributors. label Sep 30, 2019

trivikr added the Hacktoberfest label Oct 1, 2019

Trott removed the Hacktoberfest label Nov 3, 2019

jgehrcke mentioned this issue Nov 28, 2019

docs: Buffer.toString(): add note about invalid data #30706

Closed

2 tasks

jasnell closed this as completed Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer.toString('utf8') appears to use wtf-8 #23280

Buffer.toString('utf8') appears to use wtf-8 #23280

AljoschaMeyer commented Oct 5, 2018 •

edited

Loading

AljoschaMeyer commented Oct 5, 2018

mscdex commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

mscdex commented Oct 5, 2018 •

edited

Loading

AljoschaMeyer commented Oct 5, 2018

BridgeAR commented Jun 11, 2019

rexagod commented Jun 15, 2019

BridgeAR commented Jun 15, 2019

rexagod commented Jun 16, 2019 •

edited

Loading

jasnell commented Jul 6, 2020

Buffer.toString('utf8') appears to use wtf-8 #23280

Buffer.toString('utf8') appears to use wtf-8 #23280

Comments

AljoschaMeyer commented Oct 5, 2018 • edited Loading

AljoschaMeyer commented Oct 5, 2018

mscdex commented Oct 5, 2018

SimonSapin commented Oct 5, 2018

mscdex commented Oct 5, 2018 • edited Loading

AljoschaMeyer commented Oct 5, 2018

BridgeAR commented Jun 11, 2019

rexagod commented Jun 15, 2019

BridgeAR commented Jun 15, 2019

rexagod commented Jun 16, 2019 • edited Loading

jasnell commented Jul 6, 2020

AljoschaMeyer commented Oct 5, 2018 •

edited

Loading

mscdex commented Oct 5, 2018 •

edited

Loading

rexagod commented Jun 16, 2019 •

edited

Loading