Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testscript: "signal: killed" exec errors on MacOS 12 #200

Closed
mvdan opened this issue Feb 8, 2023 · 29 comments · Fixed by #222
Closed

testscript: "signal: killed" exec errors on MacOS 12 #200

mvdan opened this issue Feb 8, 2023 · 29 comments · Fixed by #222
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@mvdan
Copy link
Collaborator

mvdan commented Feb 8, 2023

     --- FAIL: TestScript/flags (0.11s)
        testscript.go:429: > exec shfmt -h
            [signal: killed]
            FAIL: testdata/script/flags.txtar:1: unexpected command failure

I've seen this in a number of projects of mine, like:

@rvagg mentions the same crash in ipld/go-car#364, and in the past, others like @mr-joshcrane have mentioned the same error on Slack.

This must be something going wrong with either testscript or Go, because for example, that TestScript/flags test from above was just running exec shfmt -h, showing the help output from a Go program. You can see that the testscript file is rather boring, so it's not doing anything particularly worrying.

Personally, I've worked around this by downgrading from macos-latest on GitHub Actions (which switched to macos-12 late last year) to macos-11, which seems to make the failures go away entirely. But of course that's not a complete fix.

I first hoped that this would be fixed in Go 1.20 with https://go-review.googlesource.com/c/go/+/460476, and that may still be true, given that there are four distinct os/exec bugs for Mac there. But it's just a good guess, I haven't verified this yet - nor do I have a Mac machine to test with. Help would be appreciated.

The only other recent mentions of "signal: killed" upstream for Mac are golang/go#57418 and golang/go#57239, and they both seem to point to processes being OOM-killed by the system. This could be the case for us as well, perhaps either due to the OS version upgrade changing the OOM behavior, or perhaps because the macos-12 GitHub machines have less available memory. But I'd also find it hard to believe, given that testscript doesn't use a particularly high amount of memory.

Filing this issue to track investigation and progress.

@mvdan mvdan added bug Something isn't working help wanted Extra attention is needed labels Feb 8, 2023
@bitfield
Copy link
Contributor

bitfield commented Feb 8, 2023

Just as a data point, I've never managed to reproduce this on macOS 10, despite many attempts.

@mvdan
Copy link
Collaborator Author

mvdan commented Feb 8, 2023

I've never seen this error on macos-10 nor macos-11 on GitHub Actions despite having used testscript for years, only on its macos-12, so I'm pretty sure that this bug only happens on MacOS 12.

@mvdan
Copy link
Collaborator Author

mvdan commented Feb 8, 2023

Just bit me in the last PR as well: https://github.com/rogpeppe/go-internal/actions/runs/4125273301/jobs/7125604762

--- FAIL: TestSimple (0.05s)
    --- FAIL: TestSimple/cover (3.25s)
        testscript.go:524: # The module uses testscript itself.
            # Use the checked out module, based on where the test binary ran. (0.692s)
            # First, a 'go test' run without coverage. (1.616s)
            # Then, a 'go test' run with -coverprofile.
            # The total coverage after merging profiles should end up being 100%.
            # Marking all printlns as covered requires all edge cases to work well.
            # Go 1.20 learned to produce and merge multiple coverage profiles,
            # so versions before then report a shallow 0% coverage. (0.920s)
            > go test -vet=off -coverprofile=cover.out -v
            [stdout]
            === RUN   TestFoo
                foo_test.go:22: failed to determine release tags from go command: signal: killed
                    
            --- FAIL: TestFoo (0.00s)
            FAIL
            coverage: 0.0% of statements
            exit status 1
            FAIL	test	0.147s
            
            [exit status 1]
            FAIL: testdata/cover.txt:18: unexpected go command failure
            
FAIL
FAIL	github.com/rogpeppe/go-internal/gotooltest	3.594s

@ldemailly
Copy link

I see this as well

@bep
Copy link
Contributor

bep commented Mar 1, 2023

I see this as well, but on Ubuntu (ubuntu-latest) with Go 1.20. I needed to remove Go 1.20 from the build matrix of one project ... I will investigate later.

gohugoio/hugoreleaser#38

@mvdan
Copy link
Collaborator Author

mvdan commented Mar 1, 2023

@bep perhaps I'm missing something, but I can't find a single mention of either "signal" or "killed" in those logs. The failures seem pretty normal - some commands inside your testscripts are failing when the script expects them to succeed.

@bep
Copy link
Contributor

bep commented Mar 1, 2023

@mvdan you're right about that, but all the tests (which uses the test script package in this repo) fails with FAIL: testscripts/misc/flags_in_env.txt:1: unexpected command failure (which I also see in the first comment in this issue, which is why I thought there would be a connection). Anyhow: For me, all the tests fail for me on Go 1.20 on Ubuntu (on GitHub) for some reason that seem to come from testscript.Run and not my tests. It sometimes fails with the same error message on my local MacBook with Go 1.20, too (MacOS 13), but it's mostly OK. On Go 1.19 it is all green. I will do some more checking later ...

@mvdan
Copy link
Collaborator Author

mvdan commented Mar 1, 2023

"unexpected command failure" is one of the relatively normal error messages you will get with testscript when the tests fail. This particular bug only happens on MacOS 12, and noone has been able to reproduce it on Linux that I am aware of, so it seems very unlikely that you're running into the same bug.

@bep
Copy link
Contributor

bep commented Mar 2, 2023

"unexpected command failure" is one of the relatively normal error messages you will get with testscript when the tests fail.

As I said, I'm pretty sure these unexpected command failure are ... unexpected and that they come from flakiness in the test infrastructure that's become more flaky in Go 1.20. But you're right, It's probably a different flakines issue; I will create a different issue somewhere once I figure out a little better who to blaim.

mvdan added a commit to mvdan/xurls that referenced this issue Mar 12, 2023
mvdan added a commit to mvdan/go-internal that referenced this issue Mar 22, 2023
Per rogpeppe#200, macos-12 can cause sporadic `signal: killed` testscript
failures, and we have started seeing them in some jobs within
go-internal itself as well. Downgrade to macos-11 for now,
like we've done in other projects, as we still don't know the cause.

Also drop test-gotip; we haven't been keeping it up to date for a while
now, so it's clearly not needed at the moment. If we want to ensure that
go-internal works on new major versions of Go before they are released,
using the beta or RC releases seems like a better and easier approach.
mvdan added a commit that referenced this issue Mar 27, 2023
Per #200, macos-12 can cause sporadic `signal: killed` testscript
failures, and we have started seeing them in some jobs within
go-internal itself as well. Downgrade to macos-11 for now,
like we've done in other projects, as we still don't know the cause.

Also drop test-gotip; we haven't been keeping it up to date for a while
now, so it's clearly not needed at the moment. If we want to ensure that
go-internal works on new major versions of Go before they are released,
using the beta or RC releases seems like a better and easier approach.
@mvdan
Copy link
Collaborator Author

mvdan commented Mar 27, 2023

@rogpeppe has been seeing the same failures now on macos-11, in PRs to this very repo. In particular, the OS version appears to be 11.7.4, and the failures happen on both Go 1.19.x and 1.20.x.

Perhaps whatever changed in macos-12 to trigger this bug was backported to macos-11 now.

@ldemailly
Copy link

I guess (possibly wrong) it's a background version of what happens when you run a new binary:

Screen Shot 2023-03-30 at 2 46 42 PM

followed by "Killed: 9" when clicking ok

@bitfield
Copy link
Contributor

@ldemailly, that sounds like a reasonable idea, but if that's the case, how come it doesn't happen every time, and how come it doesn't happen when you run go test? I mean, that is building and executing a binary in the background. All testscript does is re-exec the same binary, so I can't work out why testscript commands appear to fail frequently (but apparently randomly) on all modern macOS versions, while go test never does.

bep added a commit to bep/go-internal that referenced this issue May 2, 2023
By adding a small sleep before `TestingM.Run()` to allow the write of the test commands to be flushed to disk.

Fixes rogpeppe#200
@bep
Copy link
Contributor

bep commented May 2, 2023

This has become a real issue for me, so I decided to take a look at it, and I found that there's a correlation between the size/amount of commands added to RunMain and this issue. I added a patch in the PR below that fixes this on my MacBook, a sleep after the commands gets written to disk (which I assume allow the OS to flush) and before the test execution.

See #219

bep added a commit to bep/go-internal that referenced this issue May 2, 2023
By adding a small sleep before `TestingM.Run()` to allow the write of the test commands to be flushed to disk.

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 2, 2023
By adding a small sleep before `TestingM.Run()` to allow the write of the test commands to be flushed to disk.

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 3, 2023
On `MacOS` there are lots of reports of unexpected failing tests with output similar to this:

```
 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure
```

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in `os.Link` by adding a small sleep before `TestingM.Run()` to allow the write of the test
command symlinks to be ready.

See rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 3, 2023
On `MacOS` there are lots of reports of unexpected failing tests with output similar to this:

```
 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure
```

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in `os.Link` by adding a small sleep before `TestingM.Run()` to allow the write of the test
command symlinks to be ready.

See rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 3, 2023
On `MacOS` there are lots of reports of unexpected failing tests with output similar to this:

```
 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure
```

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in `os.Link` by adding a small sleep before `TestingM.Run()` to allow the write of the test
command symlinks to be ready.

See rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 3, 2023
On `MacOS` there are lots of reports of unexpected failing tests with output similar to this:

```
 [signal: killed]
            FAIL: testscripts/myecho.txt:1: unexpected command failure
```

On CI builds the workaround has been to downgrade to a builder with MacOS <= 11 (e.g. macos-11 on GitHub).

In development on a MacBook, this is not an option.

This commit works around what seem to be a upstream bug in `os.Link` by adding a small sleep before `TestingM.Run()` to allow the write of the test
command hard links to be ready.

See rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 3, 2023
By doing a full copy and not a hard link of the binaries.

This is the fall back used already for Windows.

This is tested OK to remove unexpected test failures with error output similar to:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

See rogpeppe#200
@mvdan
Copy link
Collaborator Author

mvdan commented May 5, 2023

It's clonefile, the one you linked to above: https://cs.opensource.google/go/x/sys/+/refs/tags/v0.8.0:unix/zsyscall_darwin_amd64.go;l=1011

Where do you see clone2? I only see clonefile(2), but that is just the rather confusing man page syntax to say that clonefile is in the category 2 System calls (functions provided by the kernel), per man man.

@bep
Copy link
Contributor

bep commented May 5, 2023

So, this works:

if runtime.GOOS != "windows" {
	if runtime.GOOS == "darwin" {
		if err := unix.Clonefile(from, to, 0); err == nil {
			return nil
		}
	} else {
		if err := os.Link(from, to); err == nil {
			return nil
		}
	}
}

But it seem to create a copy of the binary1. Or, according to the docs, it shouldn't...

Footnotes

  1. I don't worry too much about that anymore, but it may be an issue in the wild. I have figured out that I can move most of my utility stuff into testscript.Params.Cmds which I assume gets compiled into the test binary.

@mvdan
Copy link
Collaborator Author

mvdan commented May 5, 2023

Full copy on Windows, Clonefile on Mac, and hard links on Linux sound good to me. Beware that we likely need build tags now, since unix.Clonefile is only defined for GOOS=darwin. Perhaps we can add a cloneFile func which is os.Link on unix && !darwin, unix.CloneFile on darwin, and return fmt.Errorf("unavailable") on !unix.

testscript.Params.Cmds don't run as separate programs and processes via os/exec, instead running in-process via a function call. That makes them a lot cheaper indeed. Less powerful as well, but for small test helpers and utilities, I find that they are often enough. In medium-to-large projects, I often end up with 1-3 TestMain commands, often to run the "real" main program in different modes, and 5-15 testscript.Params commands.

@bep
Copy link
Contributor

bep commented May 5, 2023

Full copy on Windows, Clonefile on Mac, and hard links on Linux sound good to me. Beware that we likely need build tags now, since unix.Clonefile is only defined for GOOS=darwin. Perhaps we can add a cloneFile func which is os.Link on unix && !darwin, unix.CloneFile on darwin, and return fmt.Errorf("unavailable") on !unix.

I can prepare a PR with something ala the above in a few hours.

bep added a commit to bep/go-internal that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
bep added a commit to bep/go-internal that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
@mvdan mvdan closed this as completed in #222 May 5, 2023
mvdan pushed a commit that referenced this issue May 5, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes #200
@myitcv
Copy link
Collaborator

myitcv commented May 6, 2023

@bep - thanks very much for digging in here to get this fixed.

@ldemailly
Copy link

it's fantastic to have a solution, I'm just a bit confused as to why symlink isn't even better/simpler?

@mvdan
Copy link
Collaborator Author

mvdan commented May 6, 2023

@ldemailly see the previous comments, particularly #200 (comment).

mvdan added a commit to mvdan/go-internal that referenced this issue May 6, 2023
Now that it seems we found a fix to rogpeppe#200, there is no reason to stick
to macos-11, which will likely be deprecated soon.

Update actions/setup-go to its latest version as well.
The new version uses caching by default, which we do not need.

While here, tidy up the cloneFile docs a bit.
mvdan added a commit that referenced this issue May 6, 2023
Now that it seems we found a fix to #200, there is no reason to stick
to macos-11, which will likely be deprecated soon.

Update actions/setup-go to its latest version as well.
The new version uses caching by default, which we do not need.

While here, tidy up the cloneFile docs a bit.
@ldemailly
Copy link

oic "[tools like go] will use the symlink target"

that seems like a bug though if they cause the args0 to change as it's fairly common to rely on that (also that mr did pass all tests so presumably it's actually working ?)

what's the size of the directory when using clonefile?

@ldemailly
Copy link

oic, neat, from the man page:

 The cloned file dst shares its data blocks with the src file [...]
 
 Subsequent writes to either the original or cloned file are private to
 the file being modified (copy-on-write).

except for EXDEV possible issue (are we sure we stay on same FS?)

@mvdan
Copy link
Collaborator Author

mvdan commented May 7, 2023

There is no guarantee that the source and destination are the same filesystem. Which is why there is a fallback with regular file copying. However, they will often be the same filesystem, as go test and testing both place temporary files under os.TempDir.

ldemailly pushed a commit to fortio/testscript that referenced this issue May 7, 2023
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
ldemailly pushed a commit to fortio/testscript that referenced this issue May 7, 2023
Now that it seems we found a fix to rogpeppe#200, there is no reason to stick
to macos-11, which will likely be deprecated soon.

Update actions/setup-go to its latest version as well.
The new version uses caching by default, which we do not need.

While here, tidy up the cloneFile docs a bit.
@ldemailly
Copy link

Thanks for the explanation (that there is a fallback already).

So a symlink, in theory, if not for that (possibly outdated?) go build -toolexec issue would be always working (and smallest disk use).

Either way I think the current solution is great and matches the previous hardlink behavior (fs wise), so it's awesome and indeed I don't see "killed" anymore on my mac 🎉

bep pushed a commit to bep/go-internal that referenced this issue May 13, 2023
Now that it seems we found a fix to rogpeppe#200, there is no reason to stick
to macos-11, which will likely be deprecated soon.

Update actions/setup-go to its latest version as well.
The new version uses caching by default, which we do not need.

While here, tidy up the cloneFile docs a bit.
ldemailly pushed a commit to fortio/testscript that referenced this issue Sep 6, 2024
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
ldemailly pushed a commit to fortio/testscript that referenced this issue Sep 6, 2024
Now that it seems we found a fix to rogpeppe#200, there is no reason to stick
to macos-11, which will likely be deprecated soon.

Update actions/setup-go to its latest version as well.
The new version uses caching by default, which we do not need.

While here, tidy up the cloneFile docs a bit.
ldemailly pushed a commit to fortio/testscript that referenced this issue Sep 6, 2024
To fix unexpected errors of type:

```
[signal: killed]
FAIL: testscripts/myecho.txt:1: unexpected command failure
```

Fixes rogpeppe#200
ldemailly pushed a commit to fortio/testscript that referenced this issue Sep 6, 2024
Now that it seems we found a fix to rogpeppe#200, there is no reason to stick
to macos-11, which will likely be deprecated soon.

Update actions/setup-go to its latest version as well.
The new version uses caching by default, which we do not need.

While here, tidy up the cloneFile docs a bit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
5 participants