6.2 KiB
Performance Ideas
Current state after regex→glob migration + inline entry processing + skip gitignore in .All mode + channel-based streaming output. findr beats fd in 3/4 cases.
Benchmark results (2026-06-17, post-channels)
| Case | fd | findr | Ratio |
|---|---|---|---|
1 -E .jj |
159ms | 112ms | 1.42x faster |
2 -H |
1.202s | 710ms | 1.69x faster |
3 -HI |
1.080s | 1.212s | 1.12x slower |
4 -E .git |
298ms | 222ms | 1.34x faster |
Channels gave the biggest single improvement since the project started. Cases 1, 2, and 4 got dramatically faster because output I/O now overlaps with directory walking. Case 3 improved from 1.18x slower to 1.12x slower.
Completed
- Per-thread result buffers — each thread accumulates locally, merges once at exit. Eliminates per-result mutex contention.
- Lean path join —
join_path/join_path_diruse stack buffer +copy+ single alloc instead ofstrings.Builder+fmt.sbprintf+clone. - Regex→glob migration — replaced regex NFA with backtracking glob matcher. Eliminated 27% of CPU spent on
add_thread/is_ignored. Biggest win. - 32KB getdents buffer — bumped from 8KB. Marginal improvement, within noise.
- Skip gitignore loading in
.Allmode — eliminated thousands of unnecessary file opens/parses in-HI. Cut system time 34% (12.4s → 8.2s). - Fixed-size threads slice — replaced
[dynamic]^thread.Threadwith[]^thread.Threadsince thread count is known upfront. - Inline entry processing — merged
read_dir_entriesintoprocess_dir. Entry names consumed directly from getdents buffer viadirent_name(d)views. Eliminated millions ofstrings.clone/deletepairs. User time dropped 38% in-HIcase. - Skip
has_git_dirprobe in.Allmode — guardedhas_git_dir(fd)withignore_mode != .All. Eliminated ~280K wastedopenatENOENT probes in-HIcase. System time dropped 33% (11.3s → 7.6s). - Channel-based streaming output — replaced global results array + mutex with
chan.Chan([]string), cap2 * thread_count. Workers flush 256-result batches through the channel; a consumer thread drains to stdout. Matches fd's architecture (crossbeam_channel::bounded(2*threads), batch size0x100). Eliminates the collect-then-write barrier. Cases 1/2/4 went from 1.1-1.3x faster to 1.3-1.7x faster.
fd vs findr architecture comparison
| Aspect | fd (ignore crate) | findr |
|---|---|---|
| Syscall | libc::readdir |
raw getdents64 |
| Entry names | Clones into owned PathBuf per entry |
Zero-copy view from getdents buffer |
.git detection |
stat(".git") per directory |
openat(fd, ".git") probe per directory |
| Gitignore setup | Before entry iteration | Before entry iteration |
| Path traversal | Full paths | Full paths |
| Glob matching | globset stratification (literals→hash, complex→regex) | Backtracking token matcher |
| Result transport | crossbeam_channel::bounded(2*threads) (lock-free MPMC) |
core:sync/chan (single-mutex ring buffer) |
| Batching | Arc<Mutex<Option<Vec>>> shared buffer, flush on first item |
Detach backing array as []string, flush when full (256) |
| Output mode | Hybrid: buffer 1000 items / 100ms → sort → stream | Direct streaming (no buffer/sort mode yet) |
Known problems
-
Allocator efficiency gap — findr still allocates 1-3 heap strings per entry (
join_pathresults, work item paths). fd does the same but benefits from Rust's allocator. Odin's default allocator may have higher per-allocation overhead. -
Channel mutex contention (unconfirmed) — Odin's
core:sync/chanuses a single mutex for the entire ring buffer. With 16 senders + 1 receiver hitting the same lock, everychan.send/chan.recvis a potential futex contention point. fd usescrossbeam_channel::boundedwhich is lock-free MPMC. Note: early spall profiles showed 11.8% futex_wait, but this was likely a profiling artifact — the channel ops generate more instrumentation events, causing the 1GB spall cap to be hit over a longer wall-time window (3.5s vs 1s), skewing the profile. Needs a fair comparison (smaller tree or larger cap) to confirm whether this is real.
Remaining ideas
-
Lock-free MPMC queue Replace Odin's mutex-based channel with a custom multi-producer-single-consumer ring buffer using atomics. Eliminates all futex syscalls on the result-transport hot path.
Design:
- Fixed-capacity ring buffer of
[]stringslots (cap =2 * thread_count, same as now) - Producer side: each worker atomic-CASes a
headcounter forward to claim a slot index, writes its batch, then sets areadyflag on the slot - Consumer side: atomic-load
head, drains all ready slots up tohead, writes to stdout, frees batches - Backpressure: if
head - tail >= cap, producer spins/waits (yields viasched_yieldorfutexwith private flag) - Close: atomic flag set by
walk_streamafter all workers joined; consumer drains remaining then exits
Alternative: Use a per-producer SPSC queue (one ring per worker thread). Consumer round-robins across all N queues. No CAS on producer side — each worker writes to its own queue with only a
store+ fence. Consumer reads from each with aload. Trades simplicity for zero contention.Risk: Low. The API surface is small (
send,recv,close). Can be swapped behind the existingflush_batchinterface without touchingwalk_workeroroutput_writer. fd'scrossbeam_channelproves lock-free MPMC is achievable.Effort: Medium. ~100-150 lines for the queue + a few tests. No changes to walker or main.
- Fixed-capacity ring buffer of
-
Arena allocator per thread Bump allocator for all transient strings (result paths, work item paths), free once at exit. Would address the allocator efficiency gap. Bigger change, helps everywhere.
-
Buffer/sort output mode (fd's approach) Buffer up to 1000 results (or 100ms deadline), sort them, then switch to streaming. Gives sorted output for small searches without sacrificing throughput on large ones. fd's
ReceiverMode::Buffering → Streamingpattern. -
Git index parsing Parse
.git/indexbinary format to show tracked dotfiles. Closes the 84-file correctness delta in cases 1/4. Last correctness gap.