4.9 KiB
Performance Ideas
Current state after regex→glob migration + 32KB getdents + skip gitignore in .All mode + inline entry processing. findr beats fd in 3/4 cases.
Benchmark results (2026-06-17, post-inline-processing)
| Case | fd | findr | Ratio |
|---|---|---|---|
1 -E .jj |
187ms | 150ms | 1.25x faster |
2 -H |
1.242s | 1.136s | 1.09x faster |
3 -HI |
1.708s | 1.612s | 1.06x slower |
4 -E .git |
306ms | 242ms | 1.26x faster |
Case 3 (-HI) wall time is now close to parity. User time dropped 38% (6.9s → 4.3s) from eliminating entry name clones, but system time rose 38% (8.2s → 11.3s) from the openat(".git") probe overhead.
Completed
- Per-thread result buffers — each thread accumulates locally, merges once at exit. Eliminates per-result mutex contention.
- Lean path join —
join_path/join_path_diruse stack buffer +copy+ single alloc instead ofstrings.Builder+fmt.sbprintf+clone. - Regex→glob migration — replaced regex NFA with backtracking glob matcher. Eliminated 27% of CPU spent on
add_thread/is_ignored. Biggest win. - 32KB getdents buffer — bumped from 8KB. Marginal improvement, within noise.
- Skip gitignore loading in .All mode — eliminated thousands of unnecessary file opens/parses in
-HI. Cut system time 34% (12.4s → 8.2s). - Fixed-size threads slice — replaced
[dynamic]^thread.Threadwith[]^thread.Threadsince thread count is known upfront. - Inline entry processing — merged
read_dir_entriesintoprocess_dir. Entry names consumed directly from getdents buffer viadirent_name(d)views. Eliminated millions ofstrings.clone/deletepairs. User time dropped 38% in-HIcase.
fd vs findr architecture comparison
| Aspect | fd (ignore crate) | findr |
|---|---|---|
| Syscall | libc::readdir |
raw getdents64 |
| Entry names | Clones into owned PathBuf per entry |
Zero-copy view from getdents buffer |
.git detection |
stat(".git") per directory |
openat(fd, ".git") probe per directory |
| Gitignore setup | Before entry iteration | Before entry iteration |
| Path traversal | Full paths | Full paths |
| Glob matching | globset stratification (literals→hash, complex→regex) | Backtracking token matcher |
Known problems
-
openat(".git")probe regression — The inline processing refactor replaced a free dirent-name scan with a paidopenatsyscall per directory (~280K directories = 280K syscalls, most returning ENOENT). User time dropped from clone elimination, but system time rose from the probe, roughly canceling out. The old code detected.gitfor free while scanning entries; the new code needs.gitinfo before processing, forcing the probe.Fixes to explore:
- Skip probe in
.Allmode — gitignore context is irrelevant, sohas_gitis unused. Eliminates ~280K ENOENT probes in-HIcase. Low effort. - Two-pass over first getdents batch — scan first batch for
.git, set up context, then process all batches..gitvirtually always appears in the first batch. Risk: not guaranteed. - Lazy context reset — process entries optimistically, reset context if
.gitfound mid-scan. Complex, entries already processed with wrong context.
- Skip probe in
-
Allocator efficiency gap — findr still allocates 1-3 heap strings per entry (
join_pathresults, work item paths). fd does the same but benefits from Rust's allocator. Odin's default allocator may have higher per-allocation overhead.
Remaining ideas
-
Skip
has_git_dirprobe in.Allmode Trivial guard. Directly addresses the system-time regression in the-HIcase. -
Arena allocator per thread Bump allocator for all transient strings (result paths, work item paths), free once at exit. Would address the allocator efficiency gap. Bigger change, helps everywhere.
-
Batched channel (fd's approach) Replace global results array with buffered channel of batches. Enables streaming output and sorting like fd does.
Allocator analysis
Each emitted entry still needs a heap-allocated result string from join_path/join_path_dir, and each subdirectory needs a cloned child_path + child_rel for the work queue. That's 1-3 heap allocs per entry × millions of entries.
fd has the same pattern (PathBuf per entry + per subdirectory) but benefits from Rust's allocator (system allocator tuned via malloc/free or jemalloc). Odin's default allocator may have higher per-allocation overhead. Options:
- Arena per thread: bulk-allocate, reset after each directory or at thread exit. Best for transient data.
- Slab allocator for small strings: most filenames are <64 bytes. A slab for small allocations could reduce fragmentation and improve cache locality.
- Test with different Odin allocators:
context.allocatorcan be swapped. Worth profiling withmem.virt_allocatoror a custom arena to measure the gap.