Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace stdlib CSV reader with simpler detector #553

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wagoodman
Copy link

@wagoodman wagoodman commented Jul 10, 2024

There is evidence that using the stdlib csv reader can be resource intensive from a memory perspective:

We're seeing evidence of this in stereoscope:

Screenshot 2024-07-10 at 12 28 47 PM

Since we are not in need of the full CSV reader functionality, this PR drops usage of the CSV reader and adds a CSV detector in its place. This yields a drastic performance improvement memory-wise (not inuse memory, total allocated memory):

Screenshot 2024-07-10 at 12 29 38 PM

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
Copy link
Owner

@gabriel-vasile gabriel-vasile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thank you. I had this on todo for so long but was reluctant because CSV reading is really hairy issue.

I have 2 comments about it:

  1. please move the code in new package called csv in this directory
  2. new detector should behave like the stdlib detector. I changed the tests and also added a fuzzing test against stdlib reader in 8d272ec. Please merge this fuzzing into the PR.

Edit: v1.4.4 and next release include performance improvements, mostly related to memory allocs. It might help but yeah... my profiling as well shows that CSV and NDJSON allocate a lot.
Edit: a benchmark between sv and svStdlib to show the improvement would be great.

@wagoodman
Copy link
Author

wagoodman commented Jul 15, 2024

I'll push the refactors shortly but I wrote a benchmark test locally that I wont push since the sv code will be gone (I squared up your other review comment, I'll make certain the benchmark test is pushed).

Benchmark test code
func BenchmarkDetectVsSv(b *testing.B) {
	fh, err := os.Open("random_data.csv")
	if err != nil {
		b.Fatalf("failed to open file: %+v", err)
	}

	contents, err := io.ReadAll(fh)
	if err != nil {
		b.Fatalf("failed to read file: %+v", err)
	}

	for _, limit := range []uint32{0, 100, 1000} {
		b.Run(fmt.Sprintf("sv(limit=%d)", limit), func(b *testing.B) {
			sv(contents, ',', limit)
		})

		b.Run(fmt.Sprintf("Detect(limit=%d)", limit), func(b *testing.B) {
			Detect(contents, ',', limit)
		})
	}

}
BenchmarkDetectVsSv
BenchmarkDetectVsSv/sv(limit=0)
BenchmarkDetectVsSv/sv(limit=0)-12         	1000000000	         0.001642 ns/op
BenchmarkDetectVsSv/Detect(limit=0)
BenchmarkDetectVsSv/Detect(limit=0)-12     	1000000000	         0.0000238 ns/op
BenchmarkDetectVsSv/sv(limit=100)
BenchmarkDetectVsSv/sv(limit=100)-12       	1000000000	         0.001492 ns/op
BenchmarkDetectVsSv/Detect(limit=100)
BenchmarkDetectVsSv/Detect(limit=100)-12   	1000000000	         0.0000017 ns/op
BenchmarkDetectVsSv/sv(limit=1000)
BenchmarkDetectVsSv/sv(limit=1000)-12      	1000000000	         0.001519 ns/op
BenchmarkDetectVsSv/Detect(limit=1000)
BenchmarkDetectVsSv/Detect(limit=1000)-12  	1000000000	         0.0000065 ns/op
PASS

So Detect() is winning out in terms of CPU time too (🎉 🌮 ).

The file under test was:

  • 421K in size
  • 40 columns per row
  • 1500 rows
  • mixed number and strings field types
  • strings have a mix of single and double quote usage

Edit: please note that LazyQuotes is not implemented in this PR, thus, my first attempt to incorporate that feature lead to lost of pain and performance loss.

@wagoodman wagoodman marked this pull request as draft July 15, 2024 21:49
@wagoodman
Copy link
Author

After running through the fuzz test, I've found a lot of edge cases. It essentially comes down to the LazyQuotes setting on the original CSV reader. I'll see what I can do, but in the meantime I've converted this to a draft to signal it might be a little while before it's ready for review...

@wagoodman
Copy link
Author

wagoodman commented Jul 17, 2024

@gabriel-vasile I've done some initial work to get the prototype working with fuzz tests, but honestly it's not worth the gains (less memory performance improvement, worse performance CPU-wise [edit: much better now, still better than std... i fixed an early return issue], and the code is rather verbose). The current state of the prototype is here anchore#2 ... quote larger than this current PR.

I might shelve this for a while, but will noodle on it in the background. One thought I had: what if using the "simple" (more performant detector) that did not support LazyQuotes as a configuration option? I also haven't characterized just switching this flag off when creating the stdlib reader.

gabriel-vasile added a commit that referenced this pull request Aug 12, 2024


When iterating over multiple files, csv detector allocated a new buffer
for each file. This change adds a pool of buffers that can be reused
between detections. The same pool is shared between csv and tsv
detectors.
@gabriel-vasile
Copy link
Owner

gabriel-vasile commented Aug 12, 2024

@wagoodman I took inspiration from json package and I added a pool of bufio.Readers for CSV and TSV detection.

The code is here: 5f825db

The benchmark shows -77% allocated bytes:

➜  magic git:(csvpprof) ✗ benchstat before after
goos: linux
goarch: amd64
pkg: github.com/gabriel-vasile/mimetype/internal/magic
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
      │   before    │               after                │
      │   sec/op    │   sec/op     vs base               │
Csv-8   3.052µ ± 3%   1.695µ ± 3%  -44.46% (p=0.001 n=7)

      │    before    │                after                │
      │     B/op     │     B/op      vs base               │
Csv-8   5.258Ki ± 0%   1.164Ki ± 0%  -77.86% (p=0.001 n=7)

      │   before   │               after               │
      │ allocs/op  │ allocs/op   vs base               │
Csv-8   13.00 ± 0%   11.00 ± 0%  -15.38% (p=0.001 n=7)

It's not ideal, there are still allocations done, but it cuts the bulk of allocated bytes. Let me know if you see improvements in stereoscope, but this commit is probably going to master.

A CSV detector that avoids allocs completely would be more than welcome, but that's a lot of work.

gabriel-vasile added a commit that referenced this pull request Aug 29, 2024


When iterating over multiple files, csv detector allocated a new buffer
for each file. This change adds a pool of buffers that can be reused
between detections. The same pool is shared between csv and tsv
detectors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants