Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks (wasmtime vs wasmer) #2

Closed
jfmc opened this issue Aug 7, 2022 · 18 comments · Fixed by #7
Closed

Memory leaks (wasmtime vs wasmer) #2

jfmc opened this issue Aug 7, 2022 · 18 comments · Fixed by #7

Comments

@jfmc
Copy link

jfmc commented Aug 7, 2022

Would wasmtime provide any advantage over wasmer?

@guregu
Copy link
Collaborator

guregu commented Aug 7, 2022

There's one blocker for using wasmtime: bytecodealliance/wasmtime-go#34
We need an easy way to capture stdout to receive results from the Prolog interpreter.
We could work around it using files but it's not ideal.

wasmer solves this with a little C shim: https://github.com/wasmerio/wasmer-go/blob/b4462e6583f8d7b964e32bdd8d065cf96fba6c08/wasmer/wasi.go#L18-L54
So it probably wouldn't be too hard to add this to wasmtime.

I think we could also maybe export Go functions into WASM and have Trealla call it via FFI to report results, but I'm not quite sure how that works and if it is possible yet.

I would be especially interested in switching to any WASM runtime with good Windows support. Then we can support all major platforms out of the box.

@guregu
Copy link
Collaborator

guregu commented Sep 1, 2022

This looks promising as well: https://github.com/tetratelabs/wazero
It's more in spirit with the library I think. Not sure if it's performance is anywhere close to the others.
wazero struggles with my broken realpath implementation in TPL so I need to fix that and do some experimenting.

@enoperm
Copy link

enoperm commented Apr 20, 2023

Sorry for "hijacking" this issue, though I do believe what I have observed is not entirely unrelated to the current state of wasmer-go

I have played around with the library, really like the concept, though I have ran into some issues with memory usage.

As an example, the code below demonstrates three separate issues:

  1. Using a single engine instance for a large number of queries will cause memory usage to climb until around 7GiBs or so, at which point all subsquent queries are stuck - I suspect it has to do with the allocator, though I have not verified it. I'm pretty sure a leak is involved somewhere because I do not think any state should be retained from just querying for nl.
  2. Using a single engine instance for a large number of queries that involve host-interop, will cause memory usage to climb until around 2GiBs, at which point I think the interop-related code uses an int32 to slice into WASM-allocated memory, resulting in a panic:
$ go run main.go -- --with-interop
2023/04/20 14:59:30
2023/04/20 15:00:53 trealla: query error: trealla: panic: runtime error: slice bounds out of range [-2130246640:]
exit status 1
  1. The reason I described the issue here lies with this leak. In theory, it is not an issue if a WASM application leaks any memory, since it is very cheap to to just throw it away and allocate one with a fresh slate... except doing so will consume memory infinitely. The runtime does not seem to ever release the instantiated WASM modules/memory, and the result is a far larger, and faster memory leak than just using a persistent instance.
package main

import (
	"context"
	"fmt"
	"log"
	"math/rand"
	"os"
	"time"

	"github.com/trealla-prolog/go/trealla"
)

func newEngine(ctx context.Context) (trealla.Prolog, error) {
	engine, err := trealla.New()
	if err != nil {
		return nil, err
	}

	err = engine.Register(ctx, "invoke", 1, trealla.Predicate(func(pl trealla.Prolog, subquery trealla.Subquery, goal trealla.Term) trealla.Term {
		v := fmt.Sprintf("%v", rand.Float64())
		return trealla.Compound{
			Functor: "invoke",
			Args: []trealla.Term{
				trealla.Atom(v),
			},
		}
	}))
	return engine, err
}

func main() {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	engine, err := newEngine(ctx)
	if err != nil {
		log.Fatal(err)
	}

	log.Print()
	defer func() {
		recover()
		log.Print()
		os.Exit(2)
	}()

	getEngine := func() trealla.Prolog {
		return engine
	}

	// perhaps ugly, but suffices for testing
	query := `nl`
	for _, arg := range os.Args[1:] {
		switch arg {
		case "--with-interop":
			query = `invoke(A)`

		case "--with-new-instance":
			getEngine = func() trealla.Prolog {
				engine, _ := newEngine(ctx)
				return engine
			}
		}
	}

	for {
		func() {
			ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
			defer cancel()
			engine := getEngine()
			doWork(ctx, query, engine)
		}()
	}
}

func doWork(ctx context.Context, query string, engine trealla.Prolog) {
	resultSet := engine.Query(ctx, query)
	defer resultSet.Close()

	for resultSet.Next(ctx) {
	}

	if err := resultSet.Err(); err != nil {
		log.Println(err)
		os.Exit(1)
	}
}

Since the alternative is getting stuck forever, I'm not entirely sure the int32 issue should be fixed just yet - I would personally prefer crashing and restarting to never getting anywhere at all, for one. But I think, assuming I'm not wrong about the source of the leak when each query is served by a new instance, a different WASM runtime might be more beneficial than it seems at first glance.

@guregu
Copy link
Collaborator

guregu commented Apr 20, 2023

Thanks @enoperm, that is very concerning. I’ll look into it 👀

@enoperm
Copy link

enoperm commented Apr 20, 2023

Do remember to use a separate cgroup or something if you run the repro code with --with-new-instance. It is a good way to test your OOM killer...

@guregu
Copy link
Collaborator

guregu commented Apr 20, 2023

I think the leak around instantiating new Prologs is my fault, not wasmer's, seems like I forgot to add a Close to Prolog. I'll add that and let's see if it helps.

I think it would also not be too hard to move to wasmtime, Trealla has some stuff now to capture stdout/stderr so we can work around the wasmtime-go issue.

@guregu
Copy link
Collaborator

guregu commented Apr 20, 2023

To be more specific, the wasmer stuff relies on runtime finalizers to free memory if they aren't explicitly closed so maybe a tight loop could prevent them from being called. Adding a Close to the Prolog interface should help with that.

@enoperm
Copy link

enoperm commented Apr 20, 2023

It is possible that I may have been harsh towards wasmer, I suppose. I'm not an expert on the Go runtime, so when I dug into the code and saw that finalizers exist, I just assumed they'd be surely called by the GC before the kernel triggers the OOM killer, and then encountered the previously linked issue and it kind of fit: if the wasm runtime does not free up things anyway, then any finalizers are pointless.

@guregu
Copy link
Collaborator

guregu commented Apr 20, 2023

I'm not too sure how Go's runtime handles them either, TBH, so no worries.
I tried adding a Close() method to the Prolog instance that calls wasmer's instance.Close() but it doesn't seem to help. I'll try and figure out whether it's a problem here or with wasmer-go. I'll also check if trealla-js has the same issue or not (I remember measuring it a while back and it was fine) which will rule out Trealla itself.
Thanks again for the test code, it is very helpful.

@guregu
Copy link
Collaborator

guregu commented Apr 20, 2023

trealla-js looks fine so it shouldn't be a problem with Trealla itself. I think the issue here is we hold a global wasmer.Store which probably just infinitely accumulates memory. So we'd need to instantiate a new module for each Prolog to avoid that, but then we run into the wasmer-go bug you linked. I'll see about using a different library.

BTW the interop bug with the negative pointers should be fixed in the next release, but it can still run out of memory if hits 4GB. Not sure if any of the Go libraries support 64-bit WASM but if they do it should be possible to fix that.

@guregu
Copy link
Collaborator

guregu commented Apr 21, 2023

Good news, I got wasmtime working (in the wasmtime branch if you want to check it out). Running the test program with --with-new-instance and closing the engine in the loop allows it to properly garbage collect the memory. Also, looks like it properly supports 64-bit Wasm so I'll play around with compiling Trealla to that this weekend.
Bad news, it seems like there's still a leak with queries against a single Prolog instance. Looks like this is happening on the Wasm side, probably from the stuff I added to get wasmtime to work ;). Should be able to fix it.
Also the Tak function benchmark seems like twice as slow on wasmtime vs wasmer, not sure what's up with that.

@guregu
Copy link
Collaborator

guregu commented Apr 21, 2023

Found the leak! Got the default test stable at around ~64MB. Looks like there's some weirdness with the interop stuff still, so once that gets fixed we should be good and I'll push the new stuff.

@guregu
Copy link
Collaborator

guregu commented Apr 21, 2023

I believe this is what was causing the interop memory issue: trealla-prolog/trealla#162
Once we fix that everything should be solid.

@guregu guregu changed the title wasmtime vs wasmer Memory leaks (wasmtime vs wasmer) Apr 21, 2023
@guregu
Copy link
Collaborator

guregu commented Apr 23, 2023

More good news: we've fixed all the leaks. Just need to tweak the interop stuff to work better with wasmtime and I can make a new release.

@enoperm
Copy link

enoperm commented Apr 23, 2023

I have plugged in the wasmtime branch into the toy project where I initially detected the memory leak and I can confirm that it can now serve each request on a new Trealla instance just fine without any apparent leaks. ✔️

@guregu
Copy link
Collaborator

guregu commented Apr 23, 2023

Great! I have some more changes I'm just about to push. Will test it a bit more and make a new release shortly.

@guregu guregu closed this as completed in #7 Apr 23, 2023
@guregu guregu reopened this Apr 23, 2023
@guregu
Copy link
Collaborator

guregu commented Apr 23, 2023

Hmm, seems like wasmtime can crash sometimes. Probably my fault, investigating.

@guregu
Copy link
Collaborator

guregu commented Apr 23, 2023

Whoops, was a concurrency issue, fixed now.
Made an issue for wasm64 support as well: #8

@guregu guregu closed this as completed Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants