Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile memory usage and describe plan of action to address issues #1936

Closed
timwu20 opened this issue Oct 28, 2021 · 8 comments
Closed

Profile memory usage and describe plan of action to address issues #1936

timwu20 opened this issue Oct 28, 2021 · 8 comments
Assignees

Comments

@timwu20
Copy link
Contributor

timwu20 commented Oct 28, 2021

Issue summary

  • Profile nodes on staging, gossamer devnet, locally to identify root cause of memory issues
  • After a certain block height the node becomes unresponsive
  • One contributing factor could be the node.deepCopy
  • Outline a plan of action to address the memory issues

Other information and links

@qdm12
Copy link
Contributor

qdm12 commented Nov 2, 2021

Notes to myself (ongoing, this will be changed/edited):

  • Parentheses are in the format (storyPoints, P,[🥝]) where 🥝 means 'low hanging fruit'

Program features

Infrastructure (after pprof server above)

  • (3,P2) Profile CPU and memory
    • After a certain block height the node becomes unresponsive ➡️ is it a deadlock?
  • (5,P1) Setup profefe on testnet
  • (1,P1) Do we have swap on our nodes? doesn't matter
  • (3,P1) Metrics setup with Prometheus and Grafana, maybe using dashboard already existing for substrate (same metrics)

Code

  • Packages that might be responsible: dot/network, lib/trie, dot/sync

Memory related

  • (1,P3,🥝) Check usage of bytes.Buffer
  • (1,P3,🥝) Check usage of pools, fix(trie): memory leak fix in lib/trie #2009
  • (2,P2) Check usage of copy
  • (5,P2) Check buffered channels (using regex make\(chan .+, [1-9].+\))
  • (2,P3,🥝) Check DeepCopy functions (search for deepcopy without case sensitivity)
  • (2,P3,🥝) Check usage of sizedBufferPool - found a problem, fixed in fix(dot/network): fix memory allocations with sizedBufferPool #1963, although not a reason for OOM
  • (2,P3,🥝) Check for arrays returned (copy, it's not a slice), use regex \[[a-zA-Z0-9]+\](byte|int|uint|uint32|uint64|int32|int64|string)+?

General soundness

@arijitAD
Copy link
Contributor

arijitAD commented Nov 8, 2021

https://drive.google.com/drive/folders/1eh5_0Kb4zbwe8qckCqJ2we6EE3nTkr4C?usp=sharing
Investigate memory consumption in the Trie.

@qdm12
Copy link
Contributor

qdm12 commented Nov 17, 2021

With #2009 memory consumption (RES) grows until it stabilizes and oscillates between 9.2 and 9.8GB, after several hours (depends on CPU speed). I'll investigate if they are obvious ways to reduce memory usage in the rest of the code in the coming days according to the plan outlined above.

EDIT: Sometimes it stabilizes, sometimes not and it goes OOM
EDIT 2: It's back to bubbing up, the sync pool leak fix slowed down the memory consumption but it's still growing.

@qdm12
Copy link
Contributor

qdm12 commented Jan 7, 2022

Problem is definitely coming from

// Insert a new node in the current generation.
newNode := n.Copy()
newNode.SetGeneration(t.generation)
// Hash of old nodes should already be computed since it belongs to older generation.
oldNodeHash := n.GetHash()
if len(oldNodeHash) > 0 {
hash := common.BytesToHash(oldNodeHash)
t.deletedKeys = append(t.deletedKeys, hash)
}
return newNode

We are appending hashes of 'old generation node' to the deletedKeys hash slice, but these are actually still referenced by their parent branch (unless it's the root).

One way to fix this would be to keep a reference to the parent branch (or nil for the root node) in order to update the reference to the node being from the older generation.

I am still reviewing and testing the code in lib/trie/trie.go to see if this is possible and to find the best way to do so without breaking another feature.

@qdm12
Copy link
Contributor

qdm12 commented Mar 8, 2022

After full unit testing coverage and full linear refactoring of trie.go, the comment above is actually wrong. Although there were some misuses of the snapshotting mechanism in certain places of the code, which were causing some minor memory leaks.

Most notably, the following issues were the root of the memory usage increasing over time:

Issues fixed during the trie refactoring

  • Minor memory leak fixes due to bad logic (regarding generation mostly)

Issues fixed by refactoring the trie node code

  • Sync.Pool memory leak due to buffers not being placed back into the buffer pool

Issues fixed in dot/state

  • Fix prune keys being empty, to remove unfinalized tries once a trie is finalized
  • Fix sync.Map not allowing proper garbage collection when a key is deleted, for the map from root hash to trie
  • Fix removing the previous finalized trie when finalizing a trie

@qdm12
Copy link
Contributor

qdm12 commented Mar 17, 2022

There is a memory leak with the gossip map of seen hashes, discovered in #2383

@qdm12
Copy link
Contributor

qdm12 commented Mar 22, 2022

Gossip map of seen hashes memory leak mentioned above is tracked by #2398 (found in audit)

@timwu20
Copy link
Contributor Author

timwu20 commented Jul 9, 2024

Closed since we have resolved a lot of our slow memory leaks. Also will be revamping state storage, so another round of testing will be done.

@timwu20 timwu20 closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants