[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

VihasMakwana · 2023-11-14T14:51:12Z

Intuitively, the fastest way of doing this would be to drop the bytes completely and just store a hash and the length of the file prefix it was calculated from. Then we'd also skip having to base64 encode all of these fingerprints to be able to store them in JSON, which I suspect costs more CPU time than the actual matching.

Originally posted by @swiatekm-sumo in #29106 (comment)

I'm converting the above discussion to a new issue—more in the comments.

github-actions · 2023-11-14T14:51:33Z

Pinging code owners:

pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

VihasMakwana · 2023-11-14T15:02:03Z

Summarizing @swiatekm-sumo 's proposal:

Comparing and storing hashes of the fingerprints seems to be more efficient than storing individual fingerprint bytes. Storing the whole fingerprint is an awkward solution to begin with in my opinion, but its primary value is that it's very simple.
You have a set of old readers from the previous cycle, and those readers have fingerprints with lengths {x, y, z}, ordered by size. So you calculate fingerprints for your new readers up to x, y, and z lengths respectively, and compare at each level. This may seem wasteful, but I think it'd be more performant in practice:

Hashes are calculated iteratively byte-by-byte anyway, so you don't incur any cost for stopping at a particular length.

Hashes are just int64s, so comparisons are very fast.

In the vast majority of cases, the set of lengths will be very small. It's very rare to have a lot of files smaller than the fingerprint size.

Using the above solution + trie wouldn't make much sense. as we'd need to run multiple checks on the trie with different hashes.
IMO, only one of the above solutions can co-exist at one time i.e. (either trie+fingerprint or storing hashes of fingerprints in an array/map)
@djaglowski what are your thoughts?

djaglowski · 2023-11-14T20:28:33Z

I'm happy to see a competition of ideas here. The current tracking mechanism cannot possibly be the best, so I'm open to anything which improves upon the situation.

I think the right way to evaluate this is to clearly define the requirements. Then we only need validate correctness and compare performance.

Here's my best attempt. This incorporates the notion of "singleton files", which I believe is critical to properly hardening this package. This was loosely introduced in #27823 but has not yet been strictly enforced. Please call out if I've missed something.

Our identification mechanism for a file must be based solely on its contents. We may assume that files are append-only, but must be able to recognize a file which has had data appended to it.
At a high level, we need to be able to track a global set of files where no two items in the set represent the same file.
The global set of files must be subdivided into discrete subsets of files, where each subset represents a stage in our lifecycle of reading and remembering the file. (Roughly, "actively reading", "done reading but still open", "closed, but not forgotten", "closed, but not forgotten 2", etc) A file must not exist in two subsets simultaneously. (Though it may be removed from one and then immediately added to another.)
We must be able to add a file to a specific subset.
We must be able to retrieve a file from a specific subset, if present, knowing that if retrieved it is also removed from both the subset and the overall set. (e.g. a Get which returns nil if not found, otherwise, removes and returns the item)
We must be able to retrieve all values from a specific subset, effectively emptying it.
The identifying type for a file should be consistent across all subsets. (e.g. always fingerprint.Fingerprint or an equivalent)
The value of a file in a subset must be generic enough to allow either *reader.Metadata (for closed files) or *reader.Reader (for open files).

djaglowski · 2023-11-14T21:05:24Z

To illustrate what I mean by the above, with a simplified management struct, I think we should end up with something like the following:

type Tracker struct {
    readingFiles FileSet
    openFiles FileSet
    closedFiles []FileSet
}

// When opening a new file to determine its contents and whether to read
func (t Tracker) readFile(path string) {

    // respect max_concurrent_files
    if len(r.readingFiles) + len(r.openFiles) >= t.maxConcurrentFiles {
        t.closeOne() // maybe need to also track order in which files were added to set
    }

    id, handle := readID(path string) 
    if id == nil { // don't bother with empty files
        handle.Close()
        return
    }
    
    // first check if we're already reading it
    if f := t.readingFiles(id); f != nil {
        handle.Close()
        t.readingFiles.Add(f) // readd, since it was removed by Get
        return 
    }

    // next best a file that's still open
    if f := t.openFiles(id); f != nil {
        handle.Close()
        t.readingFiles.Add(f)
        return true
    }

    // if we remember it, we can pick up from a checkpoint
    for i := 0; i < len(t.closedFiles[i]); i++ {
        if md := t.closedFiles[i]; md != nil {
            f := reader.NewFromMetadata(md, handle)
            t.readingFiles.Add(f)
            return
        }
    }

    // No memory of this file
    f := reader.New(id, handle)
    t.readingFiles.Add(f)
    return
}

// When we've read to the end of a file
func (t Tracker) doneReading(id FileID) {
    f := t.readingFiles.Get(f.ID)
    t.openFiles.Add(id, f)
}

// When we need to close a file
func (t Tracker) closeFile(id FileID) {
    f := t.openFiles.Get(f.ID)
    md := f.Close()
    t.closedFiles.Add(id, md)
}

// When we reach the end of a poll cycle
func (t Tracker) forgetOldest() {
    // drop oldest, shift each set, instantiate new at [0]
}

VihasMakwana · 2023-11-16T13:28:36Z

I like the simplicity here. Thanks for clarifying!
The requirements make a lot of sense and I was thinking of comparing the two approaches mentioned above. Does it sound good by you?

VihasMakwana · 2023-12-08T13:33:03Z

@djaglowski I compared both approaches.

TRIE

TRIE is only useful if we have a read-heavy system and write less than reads.
In our case, the read-to-write ratio is 1:1 approximately as we will update the trie after every poll and read it during the poll.
The time complexity for writing all files in Trie is O(fingerprint_size * max_concurrent_files).
For comparing from trie, we spend O(fingerprint_size) time on average.
Also, we need to consider the overhead in individual functions. (for eg. setting value, comparing nodes)

Current approach

We append all the readers, so O(max_concurrent_files).
For comparing, worst case isO(fingerprint_size * max_concurrent_files).
- Bear in mind, in most cases, we stop early if we find a match or if we encounter a mismatching byte.
- So comparison can be fairly quick.

The results are almost identical considering normal cases. TRIE sometimes falls short, because the writing complexity is always O(fingerprint_size * max_concurrent_files).
Whereas, in the current approach we have the advantage of stopping early so it outperforms TRIE sometimes.
TRIE outperforms the current scenario for extreme cases such as, the fingerprint is smaller and there's a mismatch at the last byte.
For average cases, the current scenario seems to be doing well.

djaglowski · 2023-12-11T16:51:26Z

Thanks for the analysis @VihasMakwana

**Description:** Following up from #30219. Adding a new package for fileset. **Link to tracking Issue:** [29273](#29273 (comment)) **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.>

**Description:** Following up from open-telemetry#30219. Adding a new package for fileset. **Link to tracking Issue:** [29273](open-telemetry#29273 (comment)) **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.>

github-actions · 2024-02-16T14:52:24Z

Pinging code owners for receiver/filelog: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-04-17T03:29:10Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/filelog: @djaglowski
pkg/stanza: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2024-04-18T18:00:10Z

Closing based on #31317 (comment).

If anyone else wants to make a PR with another implementation, please feel free to ping me and I'll reopen the issue. Otherwise, this is unplanned.

github-actions bot added the pkg/stanza label Nov 14, 2023

crobert-1 added the enhancement New feature or request label Nov 14, 2023

github-actions bot mentioned this issue Nov 21, 2023

Weekly Report: 2023-11-14 - 2023-11-21 #29422

Closed

djaglowski mentioned this issue Dec 1, 2023

Modify Fingerprint to Hashed Values #29617

Closed

djaglowski mentioned this issue Dec 11, 2023

[Draft] Fingerprint Hashing Solution #29691

Closed

VihasMakwana mentioned this issue Dec 26, 2023

[chore][pkg/stanza]: Move reader's logic to a new tracker #30219

Closed

VihasMakwana mentioned this issue Jan 14, 2024

[chore][pkg/stanza]: add new fileset package #30550

Merged

djaglowski added the priority:p2 Medium label Jan 23, 2024

djaglowski added the receiver/filelog label Feb 16, 2024

github-actions bot added the Stale label Apr 17, 2024

djaglowski closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

VihasMakwana commented Nov 14, 2023 •

edited

Loading

github-actions bot commented Nov 14, 2023

VihasMakwana commented Nov 14, 2023

djaglowski commented Nov 14, 2023 •

edited

Loading

djaglowski commented Nov 14, 2023

VihasMakwana commented Nov 16, 2023

VihasMakwana commented Dec 8, 2023

djaglowski commented Dec 11, 2023

github-actions bot commented Feb 16, 2024

github-actions bot commented Apr 17, 2024

djaglowski commented Apr 18, 2024

[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

[pkg/stanza] - Performance improvements while comparing fingerprints in fileconsumer #29273

Comments

VihasMakwana commented Nov 14, 2023 • edited Loading

github-actions bot commented Nov 14, 2023

VihasMakwana commented Nov 14, 2023

djaglowski commented Nov 14, 2023 • edited Loading

djaglowski commented Nov 14, 2023

VihasMakwana commented Nov 16, 2023

VihasMakwana commented Dec 8, 2023

djaglowski commented Dec 11, 2023

github-actions bot commented Feb 16, 2024

github-actions bot commented Apr 17, 2024

djaglowski commented Apr 18, 2024

VihasMakwana commented Nov 14, 2023 •

edited

Loading

djaglowski commented Nov 14, 2023 •

edited

Loading