Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

Closed
VihasMakwana opened this issue Nov 14, 2023 · 10 comments

Comments

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Nov 14, 2023

Intuitively, the fastest way of doing this would be to drop the bytes completely and just store a hash and the length of the file prefix it was calculated from. Then we'd also skip having to base64 encode all of these fingerprints to be able to store them in JSON, which I suspect costs more CPU time than the actual matching.

Originally posted by @swiatekm-sumo in #29106 (comment)

I'm converting the above discussion to a new issue—more in the comments.

Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@VihasMakwana
Copy link
Contributor Author

Summarizing @swiatekm-sumo 's proposal:

Comparing and storing hashes of the fingerprints seems to be more efficient than storing individual fingerprint bytes. Storing the whole fingerprint is an awkward solution to begin with in my opinion, but its primary value is that it's very simple.
You have a set of old readers from the previous cycle, and those readers have fingerprints with lengths {x, y, z}, ordered by size. So you calculate fingerprints for your new readers up to x, y, and z lengths respectively, and compare at each level. This may seem wasteful, but I think it'd be more performant in practice:

  • Hashes are calculated iteratively byte-by-byte anyway, so you don't incur any cost for stopping at a particular length.
  • Hashes are just int64s, so comparisons are very fast.
  • In the vast majority of cases, the set of lengths will be very small. It's very rare to have a lot of files smaller than the fingerprint size.

Using the above solution + trie wouldn't make much sense. as we'd need to run multiple checks on the trie with different hashes.
IMO, only one of the above solutions can co-exist at one time i.e. (either trie+fingerprint or storing hashes of fingerprints in an array/map)
@djaglowski what are your thoughts?

@crobert-1 crobert-1 added the enhancement New feature or request label Nov 14, 2023
@djaglowski
Copy link
Member

djaglowski commented Nov 14, 2023

I'm happy to see a competition of ideas here. The current tracking mechanism cannot possibly be the best, so I'm open to anything which improves upon the situation.

I think the right way to evaluate this is to clearly define the requirements. Then we only need validate correctness and compare performance.

Here's my best attempt. This incorporates the notion of "singleton files", which I believe is critical to properly hardening this package. This was loosely introduced in #27823 but has not yet been strictly enforced. Please call out if I've missed something.

  1. Our identification mechanism for a file must be based solely on its contents. We may assume that files are append-only, but must be able to recognize a file which has had data appended to it.
  2. At a high level, we need to be able to track a global set of files where no two items in the set represent the same file.
  3. The global set of files must be subdivided into discrete subsets of files, where each subset represents a stage in our lifecycle of reading and remembering the file. (Roughly, "actively reading", "done reading but still open", "closed, but not forgotten", "closed, but not forgotten 2", etc) A file must not exist in two subsets simultaneously. (Though it may be removed from one and then immediately added to another.)
  4. We must be able to add a file to a specific subset.
  5. We must be able to retrieve a file from a specific subset, if present, knowing that if retrieved it is also removed from both the subset and the overall set. (e.g. a Get which returns nil if not found, otherwise, removes and returns the item)
  6. We must be able to retrieve all values from a specific subset, effectively emptying it.
  7. The identifying type for a file should be consistent across all subsets. (e.g. always fingerprint.Fingerprint or an equivalent)
  8. The value of a file in a subset must be generic enough to allow either *reader.Metadata (for closed files) or *reader.Reader (for open files).

@djaglowski
Copy link
Member

To illustrate what I mean by the above, with a simplified management struct, I think we should end up with something like the following:

type Tracker struct {
    readingFiles FileSet
    openFiles FileSet
    closedFiles []FileSet
}

// When opening a new file to determine its contents and whether to read
func (t Tracker) readFile(path string) {

    // respect max_concurrent_files
    if len(r.readingFiles) + len(r.openFiles) >= t.maxConcurrentFiles {
        t.closeOne() // maybe need to also track order in which files were added to set
    }

    id, handle := readID(path string) 
    if id == nil { // don't bother with empty files
        handle.Close()
        return
    }
    
    // first check if we're already reading it
    if f := t.readingFiles(id); f != nil {
        handle.Close()
        t.readingFiles.Add(f) // readd, since it was removed by Get
        return 
    }

    // next best a file that's still open
    if f := t.openFiles(id); f != nil {
        handle.Close()
        t.readingFiles.Add(f)
        return true
    }

    // if we remember it, we can pick up from a checkpoint
    for i := 0; i < len(t.closedFiles[i]); i++ {
        if md := t.closedFiles[i]; md != nil {
            f := reader.NewFromMetadata(md, handle)
            t.readingFiles.Add(f)
            return
        }
    }

    // No memory of this file
    f := reader.New(id, handle)
    t.readingFiles.Add(f)
    return
}

// When we've read to the end of a file
func (t Tracker) doneReading(id FileID) {
    f := t.readingFiles.Get(f.ID)
    t.openFiles.Add(id, f)
}

// When we need to close a file
func (t Tracker) closeFile(id FileID) {
    f := t.openFiles.Get(f.ID)
    md := f.Close()
    t.closedFiles.Add(id, md)
}

// When we reach the end of a poll cycle
func (t Tracker) forgetOldest() {
    // drop oldest, shift each set, instantiate new at [0]
}

@VihasMakwana
Copy link
Contributor Author

I like the simplicity here. Thanks for clarifying!
The requirements make a lot of sense and I was thinking of comparing the two approaches mentioned above. Does it sound good by you?

@VihasMakwana
Copy link
Contributor Author

@djaglowski I compared both approaches.

  1. TRIE
  • TRIE is only useful if we have a read-heavy system and write less than reads.
  • In our case, the read-to-write ratio is 1:1 approximately as we will update the trie after every poll and read it during the poll.
  • The time complexity for writing all files in Trie is O(fingerprint_size * max_concurrent_files).
  • For comparing from trie, we spend O(fingerprint_size) time on average.
  • Also, we need to consider the overhead in individual functions. (for eg. setting value, comparing nodes)
  1. Current approach
  • We append all the readers, so O(max_concurrent_files).
  • For comparing, worst case isO(fingerprint_size * max_concurrent_files).
    • Bear in mind, in most cases, we stop early if we find a match or if we encounter a mismatching byte.
    • So comparison can be fairly quick.

The results are almost identical considering normal cases. TRIE sometimes falls short, because the writing complexity is always O(fingerprint_size * max_concurrent_files).
Whereas, in the current approach we have the advantage of stopping early so it outperforms TRIE sometimes.
TRIE outperforms the current scenario for extreme cases such as, the fingerprint is smaller and there's a mismatch at the last byte.
For average cases, the current scenario seems to be doing well.

@djaglowski
Copy link
Member

Thanks for the analysis @VihasMakwana

djaglowski pushed a commit that referenced this issue Jan 18, 2024
**Description:** Following up from
#30219.
Adding a new package for fileset.

**Link to tracking Issue:**
[29273](#29273 (comment))

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
@djaglowski djaglowski added the priority:p2 Medium label Jan 23, 2024
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Feb 1, 2024
**Description:** Following up from
open-telemetry#30219.
Adding a new package for fileset.

**Link to tracking Issue:**
[29273](open-telemetry#29273 (comment))

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
Copy link
Contributor

Pinging code owners for receiver/filelog: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 17, 2024
@djaglowski
Copy link
Member

Closing based on #31317 (comment).

If anyone else wants to make a PR with another implementation, please feel free to ping me and I'll reopen the issue. Otherwise, this is unplanned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants