From aa01448ab8b88dd45ce9221eef9d65bffd4f59ca Mon Sep 17 00:00:00 2001 From: rexagod Date: Sun, 16 Jun 2019 12:06:47 +0530 Subject: [PATCH 1/6] doc: add documentation for invalid byte sequences added documentation on evaluating legal code points, and the behavior that stems from it otherwise. Fixes: https://github.com/nodejs/node/issues/23280 --- doc/api/buffer.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/doc/api/buffer.md b/doc/api/buffer.md index dfb18eeb5615c2..23510d6cfff443 100644 --- a/doc/api/buffer.md +++ b/doc/api/buffer.md @@ -165,6 +165,38 @@ console.log(Buffer.from('fhqwhgads', 'utf16le')); // Prints: ``` +### Evaluating legal code points for '`utf-8'` encoding + +Byte sequences that do not have corresponding UTF-16 encodings and non-legal +Unicode values, along with their UTF-8 counterparts must be treated as +invalid byte sequences. + +For cases regarding operations other than employing backward compatibility +for 7-bit (and [extended 8-bit]((https://en.wikipedia.org/wiki/UTF-8#Description)) +in rare cases) `'ascii'` data, and the valid [`UTF-8` code units](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout), +it should be noted that the replacement character (`�`) is returned, +and *no exception will be thrown*. + +It should also be noted that a `U+FFFD` replacement value +(representing the aforementioned replacement character) will be returned +in case of decoding errors (invalid unicode scalar values). + +```js +// assume an invalid byte sequence +const buf = Buffer.from([237, 166, 164]); + +const buf_str = buf.toString('utf-8'); + +console.log(buf_str); +// Prints: '�' + +console.log(buf.byteLength(buf_str)); +// Prints: 3 + +console.log(buf.codePointAt(0).toString(16)); +// Prints: 'fffd' +``` + The character encodings currently supported by Node.js include: * `'ascii'` - For 7-bit ASCII data only. This encoding is fast and will strip From a5c99156db82c6ccfa3355498912e47f093d4828 Mon Sep 17 00:00:00 2001 From: Pranshu Srivastava Date: Sun, 16 Jun 2019 12:56:05 +0530 Subject: [PATCH 2/6] Update buffer.md --- doc/api/buffer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/api/buffer.md b/doc/api/buffer.md index 23510d6cfff443..21964d92ba9f66 100644 --- a/doc/api/buffer.md +++ b/doc/api/buffer.md @@ -182,7 +182,7 @@ It should also be noted that a `U+FFFD` replacement value in case of decoding errors (invalid unicode scalar values). ```js -// assume an invalid byte sequence +// Assuming an invalid byte sequence const buf = Buffer.from([237, 166, 164]); const buf_str = buf.toString('utf-8'); From 0b2d59101e119a398b56438de763f55454ecfa72 Mon Sep 17 00:00:00 2001 From: Pranshu Srivastava Date: Sun, 16 Jun 2019 19:03:48 +0530 Subject: [PATCH 3/6] re-position doc segment --- doc/api/buffer.md | 64 +++++++++++++++++++++++------------------------ 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/doc/api/buffer.md b/doc/api/buffer.md index 21964d92ba9f66..2b62097e7daa74 100644 --- a/doc/api/buffer.md +++ b/doc/api/buffer.md @@ -165,38 +165,6 @@ console.log(Buffer.from('fhqwhgads', 'utf16le')); // Prints: ``` -### Evaluating legal code points for '`utf-8'` encoding - -Byte sequences that do not have corresponding UTF-16 encodings and non-legal -Unicode values, along with their UTF-8 counterparts must be treated as -invalid byte sequences. - -For cases regarding operations other than employing backward compatibility -for 7-bit (and [extended 8-bit]((https://en.wikipedia.org/wiki/UTF-8#Description)) -in rare cases) `'ascii'` data, and the valid [`UTF-8` code units](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout), -it should be noted that the replacement character (`�`) is returned, -and *no exception will be thrown*. - -It should also be noted that a `U+FFFD` replacement value -(representing the aforementioned replacement character) will be returned -in case of decoding errors (invalid unicode scalar values). - -```js -// Assuming an invalid byte sequence -const buf = Buffer.from([237, 166, 164]); - -const buf_str = buf.toString('utf-8'); - -console.log(buf_str); -// Prints: '�' - -console.log(buf.byteLength(buf_str)); -// Prints: 3 - -console.log(buf.codePointAt(0).toString(16)); -// Prints: 'fffd' -``` - The character encodings currently supported by Node.js include: * `'ascii'` - For 7-bit ASCII data only. This encoding is fast and will strip @@ -229,6 +197,38 @@ the WHATWG specification it is possible that the server actually returned `'win-1252'`-encoded data, and using `'latin1'` encoding may incorrectly decode the characters. +### Evaluating legal code points for '`utf-8'` encoding + +Byte sequences that do not have corresponding UTF-16 encodings and non-legal +Unicode values, along with their UTF-8 counterparts must be treated as +invalid byte sequences. + +For cases regarding operations other than employing backward compatibility +for 7-bit (and [extended 8-bit]((https://en.wikipedia.org/wiki/UTF-8#Description)) +in rare cases) `'ascii'` data, and the valid [`UTF-8` code units](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout), +it should be noted that the replacement character (`�`) is returned, +and *no exception will be thrown*. + +It should also be noted that a `U+FFFD` replacement value +(representing the aforementioned replacement character) will be returned +in case of decoding errors (invalid unicode scalar values). + +```js +// Assuming an invalid byte sequence +const buf = Buffer.from([237, 166, 164]); + +const buf_str = buf.toString('utf-8'); + +console.log(buf_str); +// Prints: '�' + +console.log(buf.byteLength(buf_str)); +// Prints: 3 + +console.log(buf.codePointAt(0).toString(16)); +// Prints: 'fffd' +``` + ## Buffers and TypedArray