Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v8 GC crash on node 4.4.7 #7654

Closed
jeroenvollenbrock opened this issue Jul 11, 2016 · 12 comments
Closed

v8 GC crash on node 4.4.7 #7654

jeroenvollenbrock opened this issue Jul 11, 2016 · 12 comments
Labels
v8 engine Issues and PRs related to the V8 dependency.

Comments

@jeroenvollenbrock
Copy link

jeroenvollenbrock commented Jul 11, 2016

  • Version: v4.4.7 (nodesource)
  • Platform: Linux 4.1.13-HomeyOS-0.8.0 deps: update openssl to 1.0.1j #1 SMP Sun Dec 20 00:02:18 UTC 2015 armv7l GNU/Linux
  • OS: Debian Jessie based linux
  • Subsystem: V8 Garbage Collection

node sometimes crashes with a V8 Fatal error after running for about 5 hours:

Fatal error in ../deps/v8/src/heap/mark-compact.cc, line 3088

Check failed: large_object->IsHeapObject().

Stack Trace:

#0  0x00a18460 in v8::base::OS::Abort ()
#1  0x00a15938 in V8_Fatal ()
#2  0x005e69b4 in v8::internal::MarkCompactCollector::IsSlotInBlackObject ()
#3  0x005e8324 in v8::internal::SlotsBuffer::RemoveInvalidSlots ()
#4  0x005e84b8 in v8::internal::MarkCompactCollector::ClearInvalidStoreAndSlotsBufferEntries ()
#5  0x005f4abc in v8::internal::MarkCompactCollector::CollectGarbage ()
#6  0x005b19a4 in v8::internal::Heap::MarkCompact ()
#7  0x005c89f0 in v8::internal::Heap::PerformGarbageCollection ()

Registers at #2:

r0             0x0      0
r1             0x0      0
r2             0xad071c 11339548
r3             0xafb7dc 11515868
r4             0x0      0
r5             0x22ff088        36696200
r6             0x2301e4c        36707916
r7             0x10     16
r8             0x3fd    1021
r9             0x7ed1843c       2127660092
r10            0x1fd56f8        33380088
r11            0x7ed18424       2127660068
r12            0xf81ab4 16259764
sp             0x7ed183f8       0x7ed183f8
lr             0x5e69b4 6187444
pc             0x5e69b4 0x5e69b4 <_ZN2v88internal20MarkCompactCollector19IsSlotInBlackObjectEPNS0_4PageEPhPPNS0_10HeapObjectE+604>
cpsr           0x200f0010       537853968

I've made an attempt to debug this myself, it looks like the isHeapObject() check at line 3088 fails due to the value of r4 (0b0), expected r4 = (0b01). r4 contains the AND of r0 and 0b11, and r0 contains the return value of v8::internal::LargeObjectSpace::FindObject(Address a). Since r0 is also 0x0, this method apparently did not find the specified object in the large object space. When i take a look at the callstack, i can see this function was invoked through the ClearInvalidStoreAndSlotsBufferEntries call, which makes me wonder if the object should exist at all in the LargeObjectSpace.

Any help debugging the cause of this issue is appreciated. A coredump and remote debugging tools are available if required.

@Fishrock123 Fishrock123 added v8 engine Issues and PRs related to the V8 dependency. v4.x labels Jul 11, 2016
@bnoordhuis
Copy link
Member

Are you using native modules? Memory corruption in add-ons sometimes causes GC crashes.

Something to try is compiling from source (./configure && make -C out BUILDTYPE=Debug) and run with --verify_heap. That hopefully catches the bug closer to the source.

It's possible you are hitting a bug that was fixed in v8/v8@01b8fc8; --verify_heap should catch it.

When i take a look at the callstack, i can see this function was invoked through the ClearInvalidStoreAndSlotsBufferEntries call, which makes me wonder if the object should exist at all in the LargeObjectSpace.

If your question is whether it's legal for LO objects to show up in the store buffer - yes, it is.

@jeroenvollenbrock
Copy link
Author

Hi Ben, I am indeed using a few native modules. I've been trying to systematically disable these modules in order to pinpoint the causing module, but this has not helped me so far as the crash continued to persist. I'll try running a debug build, but this might take a while since it takes a few hours to reproduce the problem.

If your question is whether it's legal for LO objects to show up in the store buffer - yes, it is.

Actually, i ment something else, the method ClearInvalidStoreAndSlotsBufferEntries implies the method is trying to clear invalid store and slot buffer entries. During this operation, it calls IsInvalidSlot, which calls IsSlotInLiveObject which calls IsSlotInBlackObject. However, IsSlotInBlack object seems to assume the Slot is a valid address by using the CHECK macro on the found object. If we are really talking about invalid slots, then what assures us the large object space (still) has a valid (Heap) object located at this slot?

@bnoordhuis
Copy link
Member

Sounds plausible. The only scenario I can think of where that could happen is when a LO page is unmapped right before iterating over the store buffer. There is some special handling around LO page freeing in LargeObjectSpace::FreeUnmarkedObjects() and Heap::FreeQueuedChunks() that might be buggy.

If that's the case, when running with --log, I think you should see a delete,MemoryChunk,$addr in the log file right before the crash, where addr corresponds to the page address of the offending object (i.e., the object address rounded down to a multiple of 4096.)

@jeroenvollenbrock
Copy link
Author

jeroenvollenbrock commented Jul 14, 2016

@bnoordhuis: I've been running a debug build for the past three days, but i have not been able to reproduce the crash using these builds yet. I've added another test setup with a release build of nodejs compiled with the v8_enable_verify_heap gyp flag and ran it with the --verify_heap argument, but this build is not crashing either. I've setup other devices to run a debug build and a (normal) release build that are looping through unittests of all individual native modules, but these setups have not crashed once either. In the meanwhile, i've received bugreports mentioning very irregular intervals, so i'm suspecting a race condition is involved homey/#690.

Do you have any other ideas?

@ofrobots
Copy link
Contributor

/cc @nodejs/v8

@ofrobots
Copy link
Contributor

@jeroenvollenbrock Are you able to run this with newer versions of Node to check if the problem still exists?

Looking at V8 ToT, the expectation seems to be that IsSlotInBlackObject is not supposed to be reached for LO space objects:

bool MarkCompactCollector::IsSlotInBlackObject(MemoryChunk* p, Address slot) {
  Space* owner = p->owner();
  DCHECK(owner != heap_->lo_space() && owner != nullptr);
  USE(owner);

It would be good to know if Node 5.x and 6.x still have the crash. I suspect it might been fixed as part of the store buffer changes that happens earlier this year.

@jeroenvollenbrock
Copy link
Author

jeroenvollenbrock commented Jul 15, 2016

@ofrobots I'll deploy node.js 6.3.0 later today but i'm not completely sure how much time it is going to take to make all dependencies compatible. If i succeed, i'll allocate a couple of testdevices and let them run over the weekend, but I doubt if it will yield any results before next Monday though.

@jeroenvollenbrock
Copy link
Author

@ofrobots I've been running a stress test using Node 6 for about 8 days, no crashes so far. I did have to update one native module (node-serialport), and started a second test using the newer version of this module on Node 4 which did not crash either. Since I'm getting close to the end of my merge windows, I've updated the module to its newest version, implemented a watchdog and configured it to upload a coredump and restart Node upon a segfault. This will hit our beta channel later this week (~1000 users), which should confirm whether or not this crash is really gone.

@hannespayer
Copy link
Contributor

v4.4.7 (nodesource) indeed has the old remembered set implementation. There we also had to check for large objects when clearing slots.

A race between unmapping and clearing should not happen, since clearing happens before sweeping, i.e. before memory gets released. I am wondering if you are flushing out a bug in LargeObjectSpace::FindPage. You could print the whole chunk_map_ map before crashing, maybe that gives us some information.

@jeroenvollenbrock
Copy link
Author

I've replace the native module node-serialport with my own implementation, and we've now been running on 4.4.7 for approximately 500000 device-hours without re-triggering the crash. I think it's therefore safe to assume the issue was caused by node-serialport.

@MylesBorins
Copy link
Contributor

/cc @reconbot

@reconbot
Copy link
Contributor

reconbot commented Sep 6, 2016

@jeroenvollenbrock I'm the maintainer of node serialport, can you share a bit more about what you found? I'd love to have it fixed asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

No branches or pull requests

7 participants