walk: Use unbounded channels #1414

tavianator · 2023-10-30T16:38:38Z

Includes #1413. The relevant commit is 5200718:

We originally switched to bounded channels for backpressure to fix #918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults. When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by snmalloc, results are recycled by sending the boxed result
over a channel back to the thread that allocated it. By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically. And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes #1408.

I benchmarked this with bfs's benchmark suite. Both fds were built against a version of ignore that included BurntSushi/ripgrep@d938e95, so we won't actually see performance quite this good until a new ignore release happens. Still, the results are good enough IMO that this fixes #1408 and #1362.

Benchmark results (updated)

Complete traversal

linux v6.5 (86,380 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux -false`	20.2 ± 0.6	19.0	21.5	1.14 ± 0.11
`find bench/corpus/linux -false`	98.5 ± 0.2	98.1	98.9	5.58 ± 0.49
`fd -u '^$' bench/corpus/linux`	190.4 ± 47.0	133.0	231.3	10.78 ± 2.83
`fd-master -u '^$' bench/corpus/linux`	60.6 ± 15.7	29.5	70.5	3.43 ± 0.94
`fd-unbounded -u '^$' bench/corpus/linux`	17.7 ± 1.5	15.3	20.0	1.00

rust 1.72.1 (192,714 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/rust -false`	52.2 ± 1.7	50.2	56.4	1.55 ± 0.08
`find bench/corpus/rust -false`	313.5 ± 1.2	311.6	315.6	9.29 ± 0.40
`fd -u '^$' bench/corpus/rust`	274.6 ± 37.6	256.6	352.3	8.14 ± 1.17
`fd-master -u '^$' bench/corpus/rust`	55.7 ± 16.7	45.2	86.0	1.65 ± 0.50
`fd-unbounded -u '^$' bench/corpus/rust`	33.8 ± 1.4	31.6	36.3	1.00

chromium 119.0.6036.2 (2,119,292 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/chromium -false`	513.5 ± 11.4	490.6	528.1	2.06 ± 0.05
`find bench/corpus/chromium -false`	3285.0 ± 6.8	3275.2	3295.2	13.15 ± 0.14
`fd -u '^$' bench/corpus/chromium`	2538.8 ± 46.8	2476.3	2582.0	10.16 ± 0.22
`fd-master -u '^$' bench/corpus/chromium`	295.3 ± 17.6	264.9	307.4	1.18 ± 0.07
`fd-unbounded -u '^$' bench/corpus/chromium`	249.9 ± 2.6	246.5	254.8	1.00

Printing paths

Without colors

linux v6.5

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux`	32.3 ± 1.6	27.5	34.8	1.03 ± 0.09
`find bench/corpus/linux`	103.0 ± 0.4	102.4	103.7	3.27 ± 0.24
`fd -u --search-path bench/corpus/linux`	192.0 ± 47.9	133.2	230.4	6.09 ± 1.58
`fd-master -u --search-path bench/corpus/linux`	87.1 ± 12.8	48.3	95.2	2.76 ± 0.45
`fd-unbounded -u --search-path bench/corpus/linux`	31.5 ± 2.3	29.5	40.2	1.00

With colors

linux v6.5

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs bench/corpus/linux -color`	208.7 ± 2.6	204.3	214.0	2.71 ± 0.09
`fd -u --search-path bench/corpus/linux --color=always`	185.3 ± 49.1	133.2	230.3	2.40 ± 0.64
`fd-master -u --search-path bench/corpus/linux --color=always`	95.9 ± 21.1	67.4	121.2	1.24 ± 0.28
`fd-unbounded -u --search-path bench/corpus/linux --color=always`	77.1 ± 2.4	74.0	81.6	1.00

Parallelism

rust 1.72.1

`-j1`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j1 bench/corpus/rust -false`	213.6 ± 0.5	212.8	214.4	1.00
`fd -j1 -u '^$' bench/corpus/rust`	277.2 ± 0.5	276.5	278.3	1.30 ± 0.00
`fd-master -j1 -u '^$' bench/corpus/rust`	283.4 ± 0.6	282.3	284.2	1.33 ± 0.00
`fd-unbounded -j1 -u '^$' bench/corpus/rust`	281.3 ± 0.5	280.6	282.1	1.32 ± 0.00

`-j2`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j2 bench/corpus/rust -false`	193.3 ± 1.0	191.5	195.2	1.24 ± 0.01
`fd -j2 -u '^$' bench/corpus/rust`	222.1 ± 5.3	216.5	231.8	1.42 ± 0.04
`fd-master -j2 -u '^$' bench/corpus/rust`	160.2 ± 1.5	158.3	162.7	1.03 ± 0.01
`fd-unbounded -j2 -u '^$' bench/corpus/rust`	155.9 ± 0.9	154.6	157.5	1.00

`-j3`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j3 bench/corpus/rust -false`	117.2 ± 6.2	108.7	125.6	1.05 ± 0.06
`fd -j3 -u '^$' bench/corpus/rust`	221.2 ± 2.5	217.1	223.7	1.99 ± 0.03
`fd-master -j3 -u '^$' bench/corpus/rust`	118.1 ± 2.4	112.9	121.0	1.06 ± 0.02
`fd-unbounded -j3 -u '^$' bench/corpus/rust`	111.2 ± 0.8	109.7	112.6	1.00

`-j4`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j4 bench/corpus/rust -false`	83.9 ± 4.1	77.1	89.6	1.00
`fd -j4 -u '^$' bench/corpus/rust`	231.4 ± 5.2	219.9	235.0	2.76 ± 0.15
`fd-master -j4 -u '^$' bench/corpus/rust`	95.4 ± 4.1	89.2	100.3	1.14 ± 0.07
`fd-unbounded -j4 -u '^$' bench/corpus/rust`	87.8 ± 1.1	85.7	89.5	1.05 ± 0.05

`-j6`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j6 bench/corpus/rust -false`	61.1 ± 1.4	58.1	63.8	1.00
`fd -j6 -u '^$' bench/corpus/rust`	230.6 ± 15.9	200.4	252.7	3.77 ± 0.27
`fd-master -j6 -u '^$' bench/corpus/rust`	74.0 ± 5.7	66.7	80.6	1.21 ± 0.10
`fd-unbounded -j6 -u '^$' bench/corpus/rust`	63.5 ± 0.9	61.8	64.8	1.04 ± 0.03

`-j8`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j8 bench/corpus/rust -false`	53.5 ± 2.2	50.1	57.4	1.04 ± 0.05
`fd -j8 -u '^$' bench/corpus/rust`	236.8 ± 13.2	222.3	259.0	4.61 ± 0.27
`fd-master -j8 -u '^$' bench/corpus/rust`	65.0 ± 7.5	57.0	73.2	1.27 ± 0.15
`fd-unbounded -j8 -u '^$' bench/corpus/rust`	51.4 ± 0.8	50.2	52.7	1.00

`-j12`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j12 bench/corpus/rust -false`	52.4 ± 2.4	47.9	57.0	1.27 ± 0.07
`fd -j12 -u '^$' bench/corpus/rust`	247.3 ± 13.7	230.0	268.8	5.99 ± 0.38
`fd-master -j12 -u '^$' bench/corpus/rust`	59.0 ± 12.0	46.9	73.3	1.43 ± 0.29
`fd-unbounded -j12 -u '^$' bench/corpus/rust`	41.3 ± 1.3	38.8	43.2	1.00

`-j16`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bfs -j16 bench/corpus/rust -false`	73.1 ± 5.5	65.9	83.5	2.06 ± 0.17
`fd -j16 -u '^$' bench/corpus/rust`	273.2 ± 10.3	246.7	280.3	7.69 ± 0.41
`fd-master -j16 -u '^$' bench/corpus/rust`	77.9 ± 0.9	76.3	80.0	2.19 ± 0.09
`fd-unbounded -j16 -u '^$' bench/corpus/rust`	35.5 ± 1.3	33.2	38.1	1.00

Details

Versions

$ bfs --version | head -n1
bfs 3.0.4
$ find --version | head -n1
find (GNU findutils) 4.9.0
$ fd --version
fd 8.7.1
$ fd-master --version
fd 8.7.1
$ fd-unbounded --version
fd 8.7.1

tavianator · 2023-10-30T17:33:51Z

An extra benchmark to justify closing #1408:

$ hyperfine -w2 fd{,-{master,after}}" -u . /tmp/empty"
Benchmark 1: fd -u . /tmp/empty
  Time (mean ± σ):     148.9 ms ±  12.1 ms    [User: 7.5 ms, System: 142.8 ms]
  Range (min … max):   129.4 ms … 165.2 ms    18 runs
 
Benchmark 2: fd-master -u . /tmp/empty
  Time (mean ± σ):      73.8 ms ±  10.0 ms    [User: 4.8 ms, System: 72.4 ms]
  Range (min … max):    57.3 ms …  83.9 ms    34 runs
 
Benchmark 3: fd-after -u . /tmp/empty
  Time (mean ± σ):       5.2 ms ±   0.7 ms    [User: 1.6 ms, System: 7.8 ms]
  Range (min … max):     1.8 ms …   7.2 ms    268 runs
 
Summary
  fd-after -u . /tmp/empty ran
   14.22 ± 2.71 times faster than fd-master -u . /tmp/empty
   28.68 ± 4.50 times faster than fd -u . /tmp/empty

src/walk.rs

sharkdp · 2023-11-01T16:18:33Z

Wow 🤩

I would like to understand this first before merging. The situation is the following: we have a MPSC scenario where the consumer is sometimes too slow to handle the incoming results (even though its only job is to print the paths to a console). In the past, we used an unbounded channel which would lead to high memory usage because the channel was buffering all those results. Then we switched to a bounded channel, but that came at a high initialization cost because the bounded channels pre-allocate a fixed size buffer for the messages. The messages are WorkerResults (size_of::<WorkerResult>() == 312). This fixed-size buffer presumably has a size of channel_size × message_size, i.e. 0x4000 × 312 byte ≈ 4.9 MiB (?). And that was slowing us down by ~70 ms on your machine? Is it because each thread allocates those 5 MiB?

In this changeset, you switch back to an unbounded channel, but WorkerResults are re-cycled. I understand that this is getting rid of the pre-allocation hit. But what about long searches (when initialization time can be neglected)? You seem to indicate that we are still faster in those situations? Couldn't this optimization be applied to bounded crossbeam channels in general (for large message sizes)? Is the allocation/deallocation really that expensive that it is worth all of this overhead (creating a sender for each worker result, making an additional copy for each worker result when recycling, …)?

Don't get me wrong. I love that this works. I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?).

Would another allocator help? I'm not really knowledgeable here but it seems to me like this whole recycling-memory part should/could be the job of a (special-purpose) allocator?

tavianator · 2023-11-01T17:27:52Z

Wow 🤩

:)

I would like to understand this first before merging. The situation is the following: we have a MPSC scenario where the consumer is sometimes too slow to handle the incoming results (even though its only job is to print the paths to a console). In the past, we used an unbounded channel which would lead to high memory usage because the channel was buffering all those results.

Exactly. It was actually fairly easy to cause this by stalling the receiver thread with something like fd | less.

Then we switched to a bounded channel, but that came at a high initialization cost because the bounded channels pre-allocate a fixed size buffer for the messages. The messages are WorkerResults (size_of::<WorkerResult>() == 312). This fixed-size buffer presumably has a size of channel_size × message_size, i.e. 0x4000 × 312 byte ≈ 4.9 MiB (?).

Yeah that times the number of threads:

fd/src/walk.rs

Line 60 in 15329f9

let (tx, rx) = bounded(0x4000 * config.threads);

And that was slowing us down by ~70 ms on your machine? Is it because each thread allocates those 5 MiB?

I'm not sure exactly why bounded() channels are that slow to initialize, but the allocation happens up-front, not in each thread, which is part of the problem.

In this changeset, you switch back to an unbounded channel, but WorkerResults are re-cycled. I understand that this is getting rid of the pre-allocation hit. But what about long searches (when initialization time can be neglected)? You seem to indicate that we are still faster in those situations?

Yeah in my benchmarks it's universally faster. But I don't know the separate impacts of

Switching to an unbounded channel
Shrinking WorkerResult to WorkerMsg
Re-using Box<WorkerResult> allocations

Couldn't this optimization be applied to bounded crossbeam channels in general (for large message sizes)? Is the allocation/deallocation really that expensive that it is worth all of this overhead (creating a sender for each worker result, making an additional copy for each worker result when recycling, …)?

Actually Sender::clone() is fairly cheap, it just bumps an atomic refcount. I think it would even be possible to do

pub struct WorkerMsg<'a> {
    inner: Option<ResultBox>,
    tx: &'a Sender<ResultBox>,
}

but I'd have to plumb the lifetime through more stuff. WorkerState would have to own the Senders I think.

Don't get me wrong. I love that this works. I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?).

Maybe it's possible? Keep in mind this is somewhat different semantics because each thread is strictly limited to its pool of 0x4000 WorkerResult. In the previous implementation, the capacity was shared between threads.

Would another allocator help? I'm not really knowledgeable here but it seems to me like this whole recycling-memory part should/could be the job of a (special-purpose) allocator?

It's possible that https://github.com/microsoft/snmalloc would have some of the same benefits. But here's the thing: this approach needs the receiver thread to tell the sender thread that it has handled a WorkerResult somehow. My first thought was to use a semaphore to limit the number of WorkerResults allocated. But a single semaphore doesn't scale very well, so then I figured I could have a separate semaphore for each sender.

But actually, an SPSC channel scales better than a semaphore! That's because the head/tail index can be on separate cache lines. For a semaphore, both threads are always contending on the same counter. So for that reason I just went with the channel implementation (and because we already had it available; I'd have to go find/write a semaphore otherwise).

tavianator · 2023-11-01T18:27:08Z

I think it would even be possible to do
pub struct WorkerMsg<'a> {
    inner: Option<ResultBox>,
    tx: &'a Sender<ResultBox>,
}
but I'd have to plumb the lifetime through more stuff. WorkerState would have to own the Senders I think.

Update: I just tried a hacky version of this and it wasn't appreciably faster

tmccombs · 2023-11-02T06:48:26Z

Re-using Box allocations

I suspect this is a pretty big component of it. Unless the worker generates results faster than the receiver can process them, and we get to the maximum amount of allocation.

It probably also helps that allocations for the channels now happens on the individual threads instead of in the spawning thread.

Looking at the code for crossbeam, it isn't just that we have to allocate the memory for the channel up front, we also have to initialize it, and it can't be done with memset or equivalent.

I'm just a bit puzzled that this isn't a strategy that could be used to speed up bounded channels in general (or maybe it is?)

I think it could be used in general. Basically implement a bounded channel as two unbounded channels and an atomic counter for the number of items allocated. And have the receiver pass the slot back after reading the value out of it. I'm not sure if that would universally improve performance, but it does have a couple of advantages: memory is allocated lazily, and it would be possible to dynamically change the size of the bound.

tavianator · 2023-11-02T14:38:10Z

Looking at the code for crossbeam, it isn't just that we have to allocate the memory for the channel up front, we also have to initialize it, and it can't be done with memset or equivalent.

Well it could be done with memset() if they changed their representation slightly, e.g.

/// A slot in a channel.
struct Slot<T> {
    /// The current stamp.
    stamp: AtomicUsize,
    /// The message in this slot.
    msg: UnsafeCell<MaybeUninit<T>>,
}

impl<T> Slot<T> {
    fn new() -> Self {
        stamp: AtomicUsize::new(),
        msg: UnsafeCell::new(MaybeUninit::zeroed()),
    }
}
...
let slot = unsafe { self.buffer.get_unchecked(index) };
let stamp = slot.stamp.load(Ordering::Acquire) + index;

And if they did that they could allocate the whole buffer with mmap() and get lazily-initialized zero pages. I use a similar trick in bfs to initialize a whole linked list from zeroed memory: https://github.com/tavianator/bfs/blob/b2ab7a151fca517f4879e76e626ec85ad3de97c7/src/alloc.c#L63-L88

We originally switched to bounded channels for backpressure to fix sharkdp#918. However, bounded channels have a significant initialization overhead as they pre-allocate a fixed-size buffer for the messages. This implementation uses a different backpressure strategy: each thread gets a limited-size pool of WorkerResults. When the size limit is hit, the sender thread has to wait for the receiver thread to handle a result from that pool and recycle it. Inspired by [snmalloc], results are recycled by sending the boxed result over a channel back to the thread that allocated it. By allocating and freeing each WorkerResult from the same thread, allocator contention is reduced dramatically. And since we now pass results by pointer instead of by value, message passing overhead is reduced as well. Fixes sharkdp#1408. [snmalloc]: https://github.com/microsoft/snmalloc

sharkdp · 2023-11-02T17:35:03Z

I did some benchmarks comparing current master (8bbbd76) with this branch (d588971).

I can confirm the huge increase in startup speed (hyperfine -w5 -N -L version master,1414 "./fd-{version} -u . /tmp/empty" --export-markdown -)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master -u . /tmp/empty`	25.4 ± 10.0	18.1	54.7	7.24 ± 3.34
`./fd-1414 -u . /tmp/empty`	3.5 ± 0.9	2.6	9.5	1.00

Unfortunately, there seems to be a large regression for longer searches with many results (hyperfine -w5 -N -L version master,1414 "./fd-{version} -u . /folder/with/3M/files" --export-markdown -):

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./fd-master -u . /home/ped1st/workspace`	841.0 ± 125.2	737.2	1069.5	1.00
`./fd-1414 -u . /home/ped1st/workspace`	1386.5 ± 18.0	1358.8	1411.5	1.65 ± 0.25

Edit: a quick perf profile seems to indicate that the additional sending back of messages (?) could be the issue here:

tavianator · 2023-11-02T17:58:02Z

Indeed... most of my benchmarks are with the pattern ^$ which never matches, so it never sends anything over the channels! If I print paths then I can reproduce:

tavianator@tachyon $ hyperfine -w2 fd{,-{master,unbounded}}" -u --search-path ~/code/bfs/bench/corpus/chromium"
Benchmark 1: fd -u --search-path ~/code/bfs/bench/corpus/chromium
  Time (mean ± σ):      2.570 s ±  0.013 s    [User: 10.525 s, System: 99.826 s]
  Range (min … max):    2.550 s …  2.590 s    10 runs
 
Benchmark 2: fd-master -u --search-path ~/code/bfs/bench/corpus/chromium
  Time (mean ± σ):     796.4 ms ±  15.2 ms    [User: 11590.2 ms, System: 3956.6 ms]
  Range (min … max):   773.4 ms … 820.2 ms    10 runs
 
Benchmark 3: fd-unbounded -u --search-path ~/code/bfs/bench/corpus/chromium
  Time (mean ± σ):      1.065 s ±  0.065 s    [User: 7.692 s, System: 5.038 s]
  Range (min … max):    0.986 s …  1.162 s    10 runs
 
Summary
  fd-master -u --search-path ~/code/bfs/bench/corpus/chromium ran
    1.34 ± 0.09 times faster than fd-unbounded -u --search-path ~/code/bfs/bench/corpus/chromium
    3.23 ± 0.06 times faster than fd -u --search-path ~/code/bfs/bench/corpus/chromium

Let me try a non-hacky version of #1414 (comment) ...

tavianator · 2023-11-02T20:47:12Z

I just pushed a couple more commits with different strategies, including passing senders by reference and my own little semaphore implementation. Unfortunately, it seems like most of the overhead is actually from the unbounded channel implementation itself, i.e. initialization time is faster but sending is slower. Let me try a different approach.

tmccombs · 2023-11-02T23:39:04Z

I wonder if it would be worthwhile to try to reduce the initialization overhead in crossbeam channel.

tmccombs · 2023-11-04T06:07:40Z

A couple of other ideas:

We could reduce the size of the channel, which would speed up initialization, but might hurt performance if the bound is the bottleneck.

Instead of having a single channel, we could have each sender thread create its own channel, that it passes back to the main thread via another channel.

Then in the receiver we either select from all the channels, or if we spawn multiple receiver threads, each receiver could process a single input channel.

tavianator · 2023-11-04T23:32:09Z

A couple of other ideas:

We could reduce the size of the channel, which would speed up initialization, but might hurt performance if the bound is the bottleneck.

I've tried this before, usually it's a perf loss

Instead of having a single channel, we could have each sender thread create its own channel, that it passes back to the main thread via another channel.

Then in the receiver we either select from all the channels, or if we spawn multiple receiver threads, each receiver could process a single input channel.

This was the "different approach" I mentioned above. It's much slower.

But I just tried another idea: instead of individual WorkerResults, send something like Arc<Mutex<Option<Vec<WorkerResult>>>> over the channel. The senders keep adding to the same batch whenever they can, and the receiver thread drains whole batches at once. It's tremendously faster than everything else I've tried:

Benchmark 1: bfs bench/corpus/chromium                                                                                                                                                       
  Time (mean ± σ):     679.5 ms ±  23.1 ms    [User: 967.2 ms, System: 3165.0 ms]
  Range (min … max):   648.4 ms … 710.3 ms    10 runs                                  
                                                                                              
Benchmark 2: find bench/corpus/chromium                                                                                                                                                      
  Time (mean ± σ):      3.213 s ±  0.020 s    [User: 0.559 s, System: 2.602 s]                                                                                                               
  Range (min … max):    3.181 s …  3.245 s    10 runs                                                                                                                                        
                                                                                              
Benchmark 3: fd -u --search-path bench/corpus/chromium
  Time (mean ± σ):      2.637 s ±  0.011 s    [User: 10.695 s, System: 99.795 s]              
  Range (min … max):    2.621 s …  2.654 s    10 runs                    
                                                                                              
Benchmark 4: fd-master -u --search-path bench/corpus/chromium     
  Time (mean ± σ):     767.8 ms ±  36.4 ms    [User: 10885.6 ms, System: 3892.2 ms]
  Range (min … max):   725.8 ms … 817.7 ms    10 runs                       
                                                                                              
Benchmark 5: fd-batch -u --search-path bench/corpus/chromium                                  
  Time (mean ± σ):     293.2 ms ±   6.7 ms    [User: 2767.8 ms, System: 3095.5 ms]            
  Range (min … max):   282.3 ms … 302.3 ms    10 runs                      
                                                                                              
Summary                                                                                       
  fd-batch -u --search-path bench/corpus/chromium ran                                         
    2.32 ± 0.09 times faster than bfs bench/corpus/chromium                                   
    2.62 ± 0.14 times faster than fd-master -u --search-path bench/corpus/chromium            
    8.99 ± 0.21 times faster than fd -u --search-path bench/corpus/chromium               
   10.96 ± 0.26 times faster than find bench/corpus/chromium                                                                                                                                 ```

I'll put up a PR for it in a bit.

tavianator force-pushed the unbounded branch from 0610bd1 to 5200718 Compare October 30, 2023 16:40

tmccombs approved these changes Oct 31, 2023

View reviewed changes

sharkdp mentioned this pull request Nov 1, 2023

Refactor scan() #1413

Merged

sharkdp reviewed Nov 1, 2023

View reviewed changes

src/walk.rs Outdated Show resolved Hide resolved

tavianator force-pushed the unbounded branch from 5200718 to 44db904 Compare November 1, 2023 17:29

tavianator marked this pull request as draft November 1, 2023 20:52

tavianator force-pushed the unbounded branch 2 times, most recently from 76e8437 to 0fbbaae Compare November 2, 2023 14:28

tavianator marked this pull request as ready for review November 2, 2023 14:28

tavianator force-pushed the unbounded branch from 0fbbaae to d588971 Compare November 2, 2023 16:14

tavianator added 2 commits November 2, 2023 16:07

walk: Hold the Sender by reference in WorkerMsg

46967b8

walk: Switch to a semaphore for ResultPool

346afd1

tavianator closed this Nov 4, 2023

tavianator mentioned this pull request Sep 16, 2024

[BUG] 🐌 fd can be much slower than GNU find in some cases #1614

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

walk: Use unbounded channels #1414

walk: Use unbounded channels #1414

tavianator commented Oct 30, 2023 •

edited

Loading

tavianator commented Oct 30, 2023

sharkdp commented Nov 1, 2023

tavianator commented Nov 1, 2023

tavianator commented Nov 1, 2023

tmccombs commented Nov 2, 2023

tavianator commented Nov 2, 2023

sharkdp commented Nov 2, 2023 •

edited

Loading

tavianator commented Nov 2, 2023

tavianator commented Nov 2, 2023

tmccombs commented Nov 2, 2023

tmccombs commented Nov 4, 2023

tavianator commented Nov 4, 2023

walk: Use unbounded channels #1414

walk: Use unbounded channels #1414

Conversation

tavianator commented Oct 30, 2023 • edited Loading

Complete traversal

linux v6.5 (86,380 files)

rust 1.72.1 (192,714 files)

chromium 119.0.6036.2 (2,119,292 files)

Printing paths

Without colors

linux v6.5

With colors

linux v6.5

Parallelism

rust 1.72.1

-j1

-j2

-j3

-j4

-j6

-j8

-j12

-j16

Details

Versions

tavianator commented Oct 30, 2023

sharkdp commented Nov 1, 2023

tavianator commented Nov 1, 2023

tavianator commented Nov 1, 2023

tmccombs commented Nov 2, 2023

tavianator commented Nov 2, 2023

sharkdp commented Nov 2, 2023 • edited Loading

tavianator commented Nov 2, 2023

tavianator commented Nov 2, 2023

tmccombs commented Nov 2, 2023

tmccombs commented Nov 4, 2023

tavianator commented Nov 4, 2023

tavianator commented Oct 30, 2023 •

edited

Loading

`-j1`

`-j2`

`-j3`

`-j4`

`-j6`

`-j8`

`-j12`

`-j16`

sharkdp commented Nov 2, 2023 •

edited

Loading