Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: added big-endian check and support to UTF-16 encode #3410

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions src/node_buffer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -460,10 +460,11 @@ void StringSlice<UCS2>(const FunctionCallbackInfo<Value>& args) {
// need to reorder on BE platforms. See http://nodejs.org/api/buffer.html
// regarding Node's "ucs2" encoding specification.
const bool aligned = (reinterpret_cast<uintptr_t>(data) % sizeof(*buf) == 0);
if (IsLittleEndian() && aligned) {
buf = reinterpret_cast<const uint16_t*>(data);
} else {
if (IsLittleEndian() && !aligned) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? It looks like a semantic null change and it makes the diff noisier than it needs to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis
In this file, there are four cases that must be considered:

  1. Little endian machine, properly aligned buf
  2. Little endian machine, misaligned buf
  3. Big endian machine, properly aligned buf
  4. Big endian machine, misaligned buf

At present, reinterpreting "buf" as data is only done for the first case, and all other cases (2, 3 and 4) will involve a copy being made to produce an aligned version of unaligned data. This operation has the side effect of doing endian swapping ONLY when run on a big-endian system. This means that on a big-endian system, we will have data stored for the wrong endianness (little endian data gets swapped by alignment operation to big endian, then encode runs its own swapping function which swaps it to little endian again).

The proposed change will apply the alignment correction operation only for case 2 (little endian, misaligned). Case 1 (little endian, aligned) needs neither endianness swapping or alignment correction, while cases 3 and 4 will be handled using StringBytes::Encode's swap operation (which also takes care of alignment issues).

ALL cases need that buf is reinterpreted as a uint16_t variable, so we reinterpret either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. Thanks.

// Make a copy to avoid unaligned accesses in v8::String::NewFromTwoByte().
// This applies ONLY to little endian platforms, as misalignment will be
// handled by a byte-swapping operation in StringBytes::Encode on
// big endian platforms.
uint16_t* copy = new uint16_t[length];
for (size_t i = 0, k = 0; i < length; i += 1, k += 2) {
// Assumes that the input is little endian.
Expand All @@ -473,6 +474,8 @@ void StringSlice<UCS2>(const FunctionCallbackInfo<Value>& args) {
}
buf = copy;
release = true;
} else {
buf = reinterpret_cast<const uint16_t*>(data);
}

args.GetReturnValue().Set(StringBytes::Encode(env->isolate(), buf, length));
Expand Down
16 changes: 12 additions & 4 deletions src/string_bytes.cc
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

#include <limits.h>
#include <string.h> // memcpy
#include <vector>

// When creating strings >= this length v8's gc spins up and consumes
// most of the execution time. For these cases it's more performant to
Expand Down Expand Up @@ -406,9 +407,7 @@ size_t StringBytes::Write(Isolate* isolate,
reinterpret_cast<uintptr_t>(buf) % sizeof(uint16_t);
if (is_aligned) {
uint16_t* const dst = reinterpret_cast<uint16_t*>(buf);
for (size_t i = 0; i < nchars; i++)
dst[i] = dst[i] << 8 | dst[i] >> 8;
break;
SwapBytes(dst, dst, nchars);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was pointed out in #7645 that this change introduces a bug: the break statement should have been retained because now the bytes are swapped back by the fallback path below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out! Does anyone already develop a fix? if not, I can do the favour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7645 should fix this when it lands but I've made a note in case that PR stalls.

}

ASSERT_EQ(sizeof(uint16_t), 2);
Expand Down Expand Up @@ -857,7 +856,16 @@ Local<Value> StringBytes::Encode(Isolate* isolate,
const uint16_t* buf,
size_t buflen) {
Local<String> val;

std::vector<uint16_t> dst;
if (IsBigEndian()) {
// Node's "ucs2" encoding expects LE character data inside a
// Buffer, so we need to reorder on BE platforms. See
// http://nodejs.org/api/buffer.html regarding Node's "ucs2"
// encoding specification
dst.resize(buflen);
SwapBytes(&dst[0], buf, buflen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, since we know buflen before allocating, and I don't see any vector specific operations besides .resize(), is using std::vector necessary?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. Simple "new uint16_t[buflen]" is sufficient, and using a vector would just increase overhead. Do you think so? @bnoordhuis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of std::vector was suggested to me by @bnoordhuis to avoid explicitly freeing memory. See #3410 (diff).

buf = &dst[0];
}
if (buflen < EXTERN_APEX) {
val = String::NewFromTwoByte(isolate,
buf,
Expand Down
14 changes: 14 additions & 0 deletions src/util-inl.h
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,20 @@ TypeName* Unwrap(v8::Local<v8::Object> object) {
return static_cast<TypeName*>(pointer);
}

void SwapBytes(uint16_t* dst, const uint16_t* src, size_t buflen) {
for (size_t i = 0; i < buflen; i++) {
// __builtin_bswap16 generates more efficient code with
// g++ 4.8 on PowerPC and other big-endian archs
#ifdef __GNUC__
dst[i] = __builtin_bswap16(src[i]);
#else
dst[i] = (src[i] << 8) | (src[i] >> 8);
#endif
}
}



} // namespace node

#endif // SRC_UTIL_INL_H_
2 changes: 2 additions & 0 deletions src/util.h
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,8 @@ inline void ClearWrap(v8::Local<v8::Object> object);
template <typename TypeName>
inline TypeName* Unwrap(v8::Local<v8::Object> object);

inline void SwapBytes(uint16_t* dst, const uint16_t* src, size_t buflen);

class Utf8Value {
public:
explicit Utf8Value(v8::Isolate* isolate, v8::Local<v8::Value> value);
Expand Down