Performance Ideas

Current state after regex→glob migration + inline entry processing + skip gitignore in .All mode + channel-based streaming output. findr beats fd in 3/4 cases.

Benchmark results (2026-06-17, post-channels)

Case	fd	findr	Ratio
1 `-E .jj`	159ms	112ms	1.42x faster
2 `-H`	1.202s	710ms	1.69x faster
3 `-HI`	1.080s	1.212s	1.12x slower
4 `-E .git`	298ms	222ms	1.34x faster

Channels gave the biggest single improvement since the project started. Cases 1, 2, and 4 got dramatically faster because output I/O now overlaps with directory walking. Case 3 improved from 1.18x slower to 1.12x slower.

Completed

Per-thread result buffers — each thread accumulates locally, merges once at exit. Eliminates per-result mutex contention.
Lean path join — join_path/join_path_dir use stack buffer + copy + single alloc instead of strings.Builder + fmt.sbprintf + clone.
Regex→glob migration — replaced regex NFA with backtracking glob matcher. Eliminated 27% of CPU spent on add_thread/is_ignored. Biggest win.
32KB getdents buffer — bumped from 8KB. Marginal improvement, within noise.
Skip gitignore loading in .All mode — eliminated thousands of unnecessary file opens/parses in -HI. Cut system time 34% (12.4s → 8.2s).
Fixed-size threads slice — replaced [dynamic]^thread.Thread with []^thread.Thread since thread count is known upfront.
Inline entry processing — merged read_dir_entries into process_dir. Entry names consumed directly from getdents buffer via dirent_name(d) views. Eliminated millions of strings.clone/delete pairs. User time dropped 38% in -HI case.
Skip has_git_dir probe in .All mode — guarded has_git_dir(fd) with ignore_mode != .All. Eliminated ~280K wasted openat ENOENT probes in -HI case. System time dropped 33% (11.3s → 7.6s).
Channel-based streaming output — replaced global results array + mutex with chan.Chan([]string), cap 2 * thread_count. Workers flush 256-result batches through the channel; a consumer thread drains to stdout. Matches fd's architecture (crossbeam_channel::bounded(2*threads), batch size 0x100). Eliminates the collect-then-write barrier. Cases 1/2/4 went from 1.1-1.3x faster to 1.3-1.7x faster.

fd vs findr architecture comparison

Aspect	fd (ignore crate)	findr
Syscall	`libc::readdir`	raw `getdents64`
Entry names	Clones into owned `PathBuf` per entry	Zero-copy view from getdents buffer
`.git` detection	`stat(".git")` per directory	`openat(fd, ".git")` probe per directory
Gitignore setup	Before entry iteration	Before entry iteration
Path traversal	Full paths	Full paths
Glob matching	globset stratification (literals→hash, complex→regex)	Backtracking token matcher
Result transport	`crossbeam_channel::bounded(2*threads)` (lock-free MPMC)	`core:sync/chan` (single-mutex ring buffer)
Batching	`Arc<Mutex<Option<Vec>>>` shared buffer, flush on first item	Detach backing array as `[]string`, flush when full (256)
Output mode	Hybrid: buffer 1000 items / 100ms → sort → stream	Direct streaming (no buffer/sort mode yet)

Known problems

Allocator efficiency gap — findr still allocates 1-3 heap strings per entry (join_path results, work item paths). fd does the same but benefits from Rust's allocator. Odin's default allocator may have higher per-allocation overhead.
Channel mutex contention (unconfirmed) — Odin's core:sync/chan uses a single mutex for the entire ring buffer. With 16 senders + 1 receiver hitting the same lock, every chan.send/chan.recv is a potential futex contention point. fd uses crossbeam_channel::bounded which is lock-free MPMC. Note: early spall profiles showed 11.8% futex_wait, but this was likely a profiling artifact — the channel ops generate more instrumentation events, causing the 1GB spall cap to be hit over a longer wall-time window (3.5s vs 1s), skewing the profile. Needs a fair comparison (smaller tree or larger cap) to confirm whether this is real.

Remaining ideas

Lock-free MPMC queue Replace Odin's mutex-based channel with a custom multi-producer-single-consumer ring buffer using atomics. Eliminates all futex syscalls on the result-transport hot path.

Design:
- Fixed-capacity ring buffer of []string slots (cap = 2 * thread_count, same as now)
- Producer side: each worker atomic-CASes a head counter forward to claim a slot index, writes its batch, then sets a ready flag on the slot
- Consumer side: atomic-load head, drains all ready slots up to head, writes to stdout, frees batches
- Backpressure: if head - tail >= cap, producer spins/waits (yields via sched_yield or futex with private flag)
- Close: atomic flag set by walk_stream after all workers joined; consumer drains remaining then exits
Alternative: Use a per-producer SPSC queue (one ring per worker thread). Consumer round-robins across all N queues. No CAS on producer side — each worker writes to its own queue with only a store + fence. Consumer reads from each with a load. Trades simplicity for zero contention.

Risk: Low. The API surface is small (send, recv, close). Can be swapped behind the existing flush_batch interface without touching walk_worker or output_writer. fd's crossbeam_channel proves lock-free MPMC is achievable.

Effort: Medium. ~100-150 lines for the queue + a few tests. No changes to walker or main.
Arena allocator per thread Bump allocator for all transient strings (result paths, work item paths), free once at exit. Would address the allocator efficiency gap. Bigger change, helps everywhere.
Buffer/sort output mode (fd's approach) Buffer up to 1000 results (or 100ms deadline), sort them, then switch to streaming. Gives sorted output for small searches without sacrificing throughput on large ones. fd's ReceiverMode::Buffering → Streaming pattern.
Git index parsing Parse .git/index binary format to show tracked dotfiles. Closes the 84-file correctness delta in cases 1/4. Last correctness gap.

6.2 KiB Raw Blame History