Performance Ideas

Current state after regex→glob migration + 32KB getdents + skip gitignore in .All mode + inline entry processing. findr beats fd in 3/4 cases.

Benchmark results (2026-06-17, post-inline-processing)

Case	fd	findr	Ratio
1 `-E .jj`	187ms	150ms	1.25x faster
2 `-H`	1.242s	1.136s	1.09x faster
3 `-HI`	1.708s	1.612s	1.06x slower
4 `-E .git`	306ms	242ms	1.26x faster

Case 3 (-HI) wall time is now close to parity. User time dropped 38% (6.9s → 4.3s) from eliminating entry name clones, but system time rose 38% (8.2s → 11.3s) from the openat(".git") probe overhead.

Completed

Per-thread result buffers — each thread accumulates locally, merges once at exit. Eliminates per-result mutex contention.
Lean path join — join_path/join_path_dir use stack buffer + copy + single alloc instead of strings.Builder + fmt.sbprintf + clone.
Regex→glob migration — replaced regex NFA with backtracking glob matcher. Eliminated 27% of CPU spent on add_thread/is_ignored. Biggest win.
32KB getdents buffer — bumped from 8KB. Marginal improvement, within noise.
Skip gitignore loading in .All mode — eliminated thousands of unnecessary file opens/parses in -HI. Cut system time 34% (12.4s → 8.2s).
Fixed-size threads slice — replaced [dynamic]^thread.Thread with []^thread.Thread since thread count is known upfront.
Inline entry processing — merged read_dir_entries into process_dir. Entry names consumed directly from getdents buffer via dirent_name(d) views. Eliminated millions of strings.clone/delete pairs. User time dropped 38% in -HI case.

fd vs findr architecture comparison

Aspect	fd (ignore crate)	findr
Syscall	`libc::readdir`	raw `getdents64`
Entry names	Clones into owned `PathBuf` per entry	Zero-copy view from getdents buffer
`.git` detection	`stat(".git")` per directory	`openat(fd, ".git")` probe per directory
Gitignore setup	Before entry iteration	Before entry iteration
Path traversal	Full paths	Full paths
Glob matching	globset stratification (literals→hash, complex→regex)	Backtracking token matcher

Known problems

openat(".git") probe regression — The inline processing refactor replaced a free dirent-name scan with a paid openat syscall per directory (~280K directories = 280K syscalls, most returning ENOENT). User time dropped from clone elimination, but system time rose from the probe, roughly canceling out. The old code detected .git for free while scanning entries; the new code needs .git info before processing, forcing the probe.

Fixes to explore:
- Skip probe in .All mode — gitignore context is irrelevant, so has_git is unused. Eliminates ~280K ENOENT probes in -HI case. Low effort.
- Two-pass over first getdents batch — scan first batch for .git, set up context, then process all batches. .git virtually always appears in the first batch. Risk: not guaranteed.
- Lazy context reset — process entries optimistically, reset context if .git found mid-scan. Complex, entries already processed with wrong context.
Allocator efficiency gap — findr still allocates 1-3 heap strings per entry (join_path results, work item paths). fd does the same but benefits from Rust's allocator. Odin's default allocator may have higher per-allocation overhead.

Remaining ideas

Skip has_git_dir probe in .All mode Trivial guard. Directly addresses the system-time regression in the -HI case.
Arena allocator per thread Bump allocator for all transient strings (result paths, work item paths), free once at exit. Would address the allocator efficiency gap. Bigger change, helps everywhere.
Batched channel (fd's approach) Replace global results array with buffered channel of batches. Enables streaming output and sorting like fd does.

Allocator analysis

Each emitted entry still needs a heap-allocated result string from join_path/join_path_dir, and each subdirectory needs a cloned child_path + child_rel for the work queue. That's 1-3 heap allocs per entry × millions of entries.

fd has the same pattern (PathBuf per entry + per subdirectory) but benefits from Rust's allocator (system allocator tuned via malloc/free or jemalloc). Odin's default allocator may have higher per-allocation overhead. Options:

Arena per thread: bulk-allocate, reset after each directory or at thread exit. Best for transient data.
Slab allocator for small strings: most filenames are <64 bytes. A slab for small allocations could reduce fragmentation and improve cache locality.
Test with different Odin allocators: context.allocator can be swapped. Worth profiling with mem.virt_allocator or a custom arena to measure the gap.

4.9 KiB Raw Blame History Unescape Escape