Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] - Use trie for previous poll readers #29106

Closed

Conversation

VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented Nov 10, 2023

Description: This is my attempt to use Trie to store readers from previous poll cycles.

@VihasMakwana VihasMakwana changed the title [WIP][pkg/stanza] - Use trie for previous poll readers [pkg/stanza] - Use trie for previous poll readers Nov 14, 2023
@VihasMakwana VihasMakwana marked this pull request as ready for review November 14, 2023 06:51
@VihasMakwana VihasMakwana requested a review from a team November 14, 2023 06:51
@VihasMakwana
Copy link
Contributor Author

@djaglowski I was thinking, can we constraint our trie to accept only comparable types? This way it would lead to read-friendly code.

@swiatekm
Copy link
Contributor

What is the actual benefit of this? Are we looking for performance improvements in looking for existing readers? Intuitively, the fastest way of doing this would be to drop the bytes completely and just store a hash and the length of the file prefix it was calculated from. Then we'd also skip having to base64 encode all of these fingerprints to be able to store them in JSON, which I suspect costs more CPU time than the actual matching.

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Nov 14, 2023

@swiatekm-sumo how would we determine the length of file prefix when we re-discover a file in future poll cycles? We do have hashes stored in history, but how would we determine a "match"?

I'm trying to understand following scenario:
Let's say we have,
file1: "hello1"
hash1:"abcdxyz", length of prefix: 6

we then store the hash and prefix in our pretty little array and move to next poll cycle.

file1 becomes: "hello1hello2"
updated hash would be completely different.

in this poll cycle, how would we establish that we have already seen the file?
I can think of only one way i.e. loop through the array, calculate hash till the previous length and see if there's any match, right? Am I understanding it correctly?

@VihasMakwana
Copy link
Contributor Author

@djaglowski I will add benchmark comparisons in PR descriptions.

@swiatekm
Copy link
Contributor

swiatekm commented Nov 14, 2023

I can think of only one way i.e. loop through the array, calculate hash till the previous length and see if there's any match, right? Am I understanding it correctly?

That's the basic idea, yeah. You have a set of old readers from the previous cycle, and those readers have fingerprints with lengths {x, y, z}, ordered by size. So you calculate fingerprints for your new readers up to x, y and z lengths respectively, and compare at each level. This may seem wasteful, but I think it'd be more performant in practice:

  1. Hashes are calculated iteratively byte-by-byte anyway, so you don't incur any cost for stopping at a particular length.
  2. Hashes are actually just int64s, so comparisons are very fast.
  3. In the vast majority of cases, the set of lengths will be very small. It's very rare to have a lot of files smaller than the fingerprint size.

Admittedly, I haven't tested this, but I wanted to point out it's an option if we're going down the path of adding a trie just to be able to compare fingeprints more efficiently. Storing the whole fingerprint is an awkward solution to begin with in my opinion, but its primary value is that it's very simple. If we're willing to make it more complex, then we should consider alternatives.

@VihasMakwana
Copy link
Contributor Author

you know what, now that you've pointed this idea I might give it a try and compare results. Thanks for pointing this out.
Alternatively, I can close this PR if you're already planning to work on this. Just let me know!

@swiatekm
Copy link
Contributor

I hadn't started anything, and won't this week, so feel free to give it a shot! I'm cool with this PR staying, though we should probably move this discussion to an issue.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 29, 2023
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants