Replace stdlib CSV reader with simpler detector #553

wagoodman · 2024-07-10T16:58:13Z

There is evidence that using the stdlib csv reader can be resource intensive from a memory perspective:

We're seeing evidence of this in stereoscope:

Since we are not in need of the full CSV reader functionality, this PR drops usage of the CSV reader and adds a CSV detector in its place. This yields a drastic performance improvement memory-wise (not inuse memory, total allocated memory):

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

gabriel-vasile

Wow, thank you. I had this on todo for so long but was reluctant because CSV reading is really hairy issue.

I have 2 comments about it:

please move the code in new package called csv in this directory
new detector should behave like the stdlib detector. I changed the tests and also added a fuzzing test against stdlib reader in 8d272ec. Please merge this fuzzing into the PR.

Edit: v1.4.4 and next release include performance improvements, mostly related to memory allocs. It might help but yeah... my profiling as well shows that CSV and NDJSON allocate a lot.
Edit: a benchmark between sv and svStdlib to show the improvement would be great.

wagoodman · 2024-07-15T14:40:06Z

I'll push the refactors shortly ~~but I wrote a benchmark test locally that I wont push since the sv code will be gone~~ (I squared up your other review comment, I'll make certain the benchmark test is pushed).

Benchmark test code

func BenchmarkDetectVsSv(b *testing.B) {
	fh, err := os.Open("random_data.csv")
	if err != nil {
		b.Fatalf("failed to open file: %+v", err)
	}

	contents, err := io.ReadAll(fh)
	if err != nil {
		b.Fatalf("failed to read file: %+v", err)
	}

	for _, limit := range []uint32{0, 100, 1000} {
		b.Run(fmt.Sprintf("sv(limit=%d)", limit), func(b *testing.B) {
			sv(contents, ',', limit)
		})

		b.Run(fmt.Sprintf("Detect(limit=%d)", limit), func(b *testing.B) {
			Detect(contents, ',', limit)
		})
	}

}

BenchmarkDetectVsSv
BenchmarkDetectVsSv/sv(limit=0)
BenchmarkDetectVsSv/sv(limit=0)-12         	1000000000	         0.001642 ns/op
BenchmarkDetectVsSv/Detect(limit=0)
BenchmarkDetectVsSv/Detect(limit=0)-12     	1000000000	         0.0000238 ns/op
BenchmarkDetectVsSv/sv(limit=100)
BenchmarkDetectVsSv/sv(limit=100)-12       	1000000000	         0.001492 ns/op
BenchmarkDetectVsSv/Detect(limit=100)
BenchmarkDetectVsSv/Detect(limit=100)-12   	1000000000	         0.0000017 ns/op
BenchmarkDetectVsSv/sv(limit=1000)
BenchmarkDetectVsSv/sv(limit=1000)-12      	1000000000	         0.001519 ns/op
BenchmarkDetectVsSv/Detect(limit=1000)
BenchmarkDetectVsSv/Detect(limit=1000)-12  	1000000000	         0.0000065 ns/op
PASS

So Detect() is winning out in terms of CPU time too (🎉 🌮 ).

The file under test was:

421K in size
40 columns per row
1500 rows
mixed number and strings field types
strings have a mix of single and double quote usage

Edit: please note that LazyQuotes is not implemented in this PR, thus, my first attempt to incorporate that feature lead to lost of pain and performance loss.

wagoodman · 2024-07-15T21:50:51Z

After running through the fuzz test, I've found a lot of edge cases. It essentially comes down to the LazyQuotes setting on the original CSV reader. I'll see what I can do, but in the meantime I've converted this to a draft to signal it might be a little while before it's ready for review...

wagoodman · 2024-07-17T12:39:13Z

@gabriel-vasile I've done some initial work to get the prototype working with fuzz tests, but honestly it's not worth the gains (less memory performance improvement, ~~worse performance CPU-wise~~ [edit: much better now, still better than std... i fixed an early return issue], and the code is rather verbose). The current state of the prototype is here anchore#2 ... quote larger than this current PR.

I might shelve this for a while, but will noodle on it in the background. One thought I had: what if using the "simple" (more performant detector) that did not support LazyQuotes as a configuration option? I also haven't characterized just switching this flag off when creating the stdlib reader.

When iterating over multiple files, csv detector allocated a new buffer for each file. This change adds a pool of buffers that can be reused between detections. The same pool is shared between csv and tsv detectors.

gabriel-vasile · 2024-08-12T17:11:20Z

@wagoodman I took inspiration from json package and I added a pool of bufio.Readers for CSV and TSV detection.

The code is here: 5f825db

The benchmark shows -77% allocated bytes:

➜  magic git:(csvpprof) ✗ benchstat before after
goos: linux
goarch: amd64
pkg: github.com/gabriel-vasile/mimetype/internal/magic
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
      │   before    │               after                │
      │   sec/op    │   sec/op     vs base               │
Csv-8   3.052µ ± 3%   1.695µ ± 3%  -44.46% (p=0.001 n=7)

      │    before    │                after                │
      │     B/op     │     B/op      vs base               │
Csv-8   5.258Ki ± 0%   1.164Ki ± 0%  -77.86% (p=0.001 n=7)

      │   before   │               after               │
      │ allocs/op  │ allocs/op   vs base               │
Csv-8   13.00 ± 0%   11.00 ± 0%  -15.38% (p=0.001 n=7)

It's not ideal, there are still allocations done, but it cuts the bulk of allocated bytes. Let me know if you see improvements in stereoscope, but this commit is probably going to master.

A CSV detector that avoids allocs completely would be more than welcome, but that's a lot of work.

When iterating over multiple files, csv detector allocated a new buffer for each file. This change adds a pool of buffers that can be reused between detections. The same pool is shared between csv and tsv detectors.

replace stdlib csv reader with simple csv detector

f966690

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

This was referenced Jul 10, 2024

More memory efficient mimetype detection anchore/stereoscope#267

Merged

Forks in use anchore/stereoscope#266

Open

gabriel-vasile reviewed Jul 12, 2024

View reviewed changes

wagoodman marked this pull request as draft July 15, 2024 21:49

gabriel-vasile mentioned this pull request Aug 29, 2024

CSV/TSV use a pool of buffered readers to avoid allocs #573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace stdlib CSV reader with simpler detector #553

Replace stdlib CSV reader with simpler detector #553

wagoodman commented Jul 10, 2024 •

edited

Loading

gabriel-vasile left a comment •

edited

Loading

wagoodman commented Jul 15, 2024 •

edited

Loading

wagoodman commented Jul 15, 2024

wagoodman commented Jul 17, 2024 •

edited

Loading

gabriel-vasile commented Aug 12, 2024 •

edited

Loading

Replace stdlib CSV reader with simpler detector #553

Are you sure you want to change the base?

Replace stdlib CSV reader with simpler detector #553

Conversation

wagoodman commented Jul 10, 2024 • edited Loading

gabriel-vasile left a comment • edited Loading

Choose a reason for hiding this comment

wagoodman commented Jul 15, 2024 • edited Loading

wagoodman commented Jul 15, 2024

wagoodman commented Jul 17, 2024 • edited Loading

gabriel-vasile commented Aug 12, 2024 • edited Loading

wagoodman commented Jul 10, 2024 •

edited

Loading

gabriel-vasile left a comment •

edited

Loading

wagoodman commented Jul 15, 2024 •

edited

Loading

wagoodman commented Jul 17, 2024 •

edited

Loading

gabriel-vasile commented Aug 12, 2024 •

edited

Loading