Files
findr/PERFORMANCE_IDEAS.md

4.9 KiB
Raw Blame History

Performance Ideas

Current state after regex→glob migration + 32KB getdents + skip gitignore in .All mode + inline entry processing. findr beats fd in 3/4 cases.

Benchmark results (2026-06-17, post-inline-processing)

Case fd findr Ratio
1 -E .jj 187ms 150ms 1.25x faster
2 -H 1.242s 1.136s 1.09x faster
3 -HI 1.708s 1.612s 1.06x slower
4 -E .git 306ms 242ms 1.26x faster

Case 3 (-HI) wall time is now close to parity. User time dropped 38% (6.9s → 4.3s) from eliminating entry name clones, but system time rose 38% (8.2s → 11.3s) from the openat(".git") probe overhead.

Completed

  1. Per-thread result buffers — each thread accumulates locally, merges once at exit. Eliminates per-result mutex contention.
  2. Lean path joinjoin_path/join_path_dir use stack buffer + copy + single alloc instead of strings.Builder + fmt.sbprintf + clone.
  3. Regex→glob migration — replaced regex NFA with backtracking glob matcher. Eliminated 27% of CPU spent on add_thread/is_ignored. Biggest win.
  4. 32KB getdents buffer — bumped from 8KB. Marginal improvement, within noise.
  5. Skip gitignore loading in .All mode — eliminated thousands of unnecessary file opens/parses in -HI. Cut system time 34% (12.4s → 8.2s).
  6. Fixed-size threads slice — replaced [dynamic]^thread.Thread with []^thread.Thread since thread count is known upfront.
  7. Inline entry processing — merged read_dir_entries into process_dir. Entry names consumed directly from getdents buffer via dirent_name(d) views. Eliminated millions of strings.clone/delete pairs. User time dropped 38% in -HI case.

fd vs findr architecture comparison

Aspect fd (ignore crate) findr
Syscall libc::readdir raw getdents64
Entry names Clones into owned PathBuf per entry Zero-copy view from getdents buffer
.git detection stat(".git") per directory openat(fd, ".git") probe per directory
Gitignore setup Before entry iteration Before entry iteration
Path traversal Full paths Full paths
Glob matching globset stratification (literals→hash, complex→regex) Backtracking token matcher

Known problems

  1. openat(".git") probe regression — The inline processing refactor replaced a free dirent-name scan with a paid openat syscall per directory (~280K directories = 280K syscalls, most returning ENOENT). User time dropped from clone elimination, but system time rose from the probe, roughly canceling out. The old code detected .git for free while scanning entries; the new code needs .git info before processing, forcing the probe.

    Fixes to explore:

    • Skip probe in .All mode — gitignore context is irrelevant, so has_git is unused. Eliminates ~280K ENOENT probes in -HI case. Low effort.
    • Two-pass over first getdents batch — scan first batch for .git, set up context, then process all batches. .git virtually always appears in the first batch. Risk: not guaranteed.
    • Lazy context reset — process entries optimistically, reset context if .git found mid-scan. Complex, entries already processed with wrong context.
  2. Allocator efficiency gap — findr still allocates 1-3 heap strings per entry (join_path results, work item paths). fd does the same but benefits from Rust's allocator. Odin's default allocator may have higher per-allocation overhead.

Remaining ideas

  1. Skip has_git_dir probe in .All mode Trivial guard. Directly addresses the system-time regression in the -HI case.

  2. Arena allocator per thread Bump allocator for all transient strings (result paths, work item paths), free once at exit. Would address the allocator efficiency gap. Bigger change, helps everywhere.

  3. Batched channel (fd's approach) Replace global results array with buffered channel of batches. Enables streaming output and sorting like fd does.

Allocator analysis

Each emitted entry still needs a heap-allocated result string from join_path/join_path_dir, and each subdirectory needs a cloned child_path + child_rel for the work queue. That's 1-3 heap allocs per entry × millions of entries.

fd has the same pattern (PathBuf per entry + per subdirectory) but benefits from Rust's allocator (system allocator tuned via malloc/free or jemalloc). Odin's default allocator may have higher per-allocation overhead. Options:

  • Arena per thread: bulk-allocate, reset after each directory or at thread exit. Best for transient data.
  • Slab allocator for small strings: most filenames are <64 bytes. A slab for small allocations could reduce fragmentation and improve cache locality.
  • Test with different Odin allocators: context.allocator can be swapped. Worth profiling with mem.virt_allocator or a custom arena to measure the gap.