diff --git a/PERFORMANCE_IDEAS.md b/PERFORMANCE_IDEAS.md index 96acbd6..5290d27 100644 --- a/PERFORMANCE_IDEAS.md +++ b/PERFORMANCE_IDEAS.md @@ -1,17 +1,17 @@ # Performance Ideas -Current state after regex→glob migration + inline entry processing + skip gitignore in .All mode + channel-based streaming output. findr beats fd in 3/4 cases. +Current state after regex→glob migration + inline entry processing + skip gitignore in .All mode + channel-based streaming output + byte-buffer output. findr beats fd in 4/4 cases. -## Benchmark results (2026-06-17, post-channels) +## Benchmark results (2026-06-17, post-byte-buffer) | Case | fd | findr | Ratio | |------|------|-------|-------| -| 1 `-E .jj` | 159ms | 112ms | **1.42x faster** | -| 2 `-H` | 1.202s | 710ms | **1.69x faster** | -| 3 `-HI` | 1.080s | 1.212s | **1.12x slower** | -| 4 `-E .git` | 298ms | 222ms | **1.34x faster** | +| 1 `-E .jj` | 148ms | 99ms | **1.50x faster** | +| 2 `-H` | 1.142s | 609ms | **1.88x faster** | +| 3 `-HI` | 1.009s | 966ms | **1.04x faster** | +| 4 `-E .git` | 268ms | 197ms | **1.36x faster** | -Channels gave the biggest single improvement since the project started. Cases 1, 2, and 4 got dramatically faster because output I/O now overlaps with directory walking. Case 3 improved from 1.18x slower to 1.12x slower. +Byte-buffer output eliminated per-result string allocations. Workers now write `path\n` directly into `[]u8` buffers sent through the channel; the output writer does a single bulk write per batch. Case 3 (`-HI`, 5.6M entries) flipped from 1.12x slower to 1.04x faster — the biggest win since it has the most output. ## Completed @@ -24,6 +24,7 @@ Channels gave the biggest single improvement since the project started. Cases 1, 7. **Inline entry processing** — merged `read_dir_entries` into `process_dir`. Entry names consumed directly from getdents buffer via `dirent_name(d)` views. Eliminated millions of `strings.clone`/`delete` pairs. User time dropped 38% in `-HI` case. 8. **Skip `has_git_dir` probe in `.All` mode** — guarded `has_git_dir(fd)` with `ignore_mode != .All`. Eliminated ~280K wasted `openat` ENOENT probes in `-HI` case. System time dropped 33% (11.3s → 7.6s). 9. **Channel-based streaming output** — replaced global results array + mutex with `chan.Chan([]string)`, cap `2 * thread_count`. Workers flush 256-result batches through the channel; a consumer thread drains to stdout. Matches fd's architecture (`crossbeam_channel::bounded(2*threads)`, batch size `0x100`). Eliminates the collect-then-write barrier. Cases 1/2/4 went from 1.1-1.3x faster to 1.3-1.7x faster. +10. **Byte-buffer output** — replaced `chan.Chan([]string)` with `chan.Chan([]u8)`. Workers write `path\n` directly into 64KB byte buffers via `append_path`; output writer does a single bulk `writer_write` per batch. Eliminates ~5M `join_path` allocs, ~5M `delete(s)` frees, ~20K batch array allocs. Case 3 (`-HI`) flipped from 1.12x slower to 1.04x faster. All 4 cases now beat fd. ## fd vs findr architecture comparison @@ -36,8 +37,8 @@ Channels gave the biggest single improvement since the project started. Cases 1, | Path traversal | Full paths | Full paths | | Glob matching | globset stratification (literals→hash, complex→regex) | Backtracking token matcher | | Result transport | `crossbeam_channel::bounded(2*threads)` (lock-free MPMC) | `core:sync/chan` (single-mutex ring buffer) | -| Batching | `Arc>>` shared buffer, flush on first item | Detach backing array as `[]string`, flush when full (256) | -| Output mode | Hybrid: buffer 1000 items / 100ms → sort → stream | Direct streaming (no buffer/sort mode yet) | +| Batching | `Arc>>` shared buffer, flush on first item | 64KB `[]u8` byte buffers, flush when full | +| Output mode | Hybrid: buffer 1000 items / 100ms → sort → stream | Bulk byte writes, direct streaming (no buffer/sort mode yet) | ## Known problems @@ -47,11 +48,51 @@ Channels gave the biggest single improvement since the project started. Cases 1, ## Remaining ideas -1. **Lock-free MPMC queue** +### Allocation strategies + +Allocation audit (per-entry hot path in `process_dir`): + +| Site | What | Est. count (-HI) | +|------|------|-------------------| +| `join_path`/`join_path_dir` for results | `make([]u8, total)` for result paths | ~5M | +| `join_path` for WorkItem paths | same, for recursed dirs | ~500K | +| `strings.clone(entry_rel)` | clone for WorkItem.rel | ~500K | +| `clone_to_c_string(dir_path)` | cstring for `open()` | ~500K | +| `flush_batch` → `make([dynamic]string)` | new batch array | ~20K | +| `delete(s)` per result | free in output writer | ~5M | + +Available Odin allocators: `core:mem` (Arena, Dynamic_Arena, Stack, etc.), `core:mem/tlsf` (TLSF — O(1) alloc/free, supports individual frees, grows via backing allocator). + +1. **Byte-buffer output — eliminate result path allocations entirely** *(COMPLETED — see #10 in Completed)* + +2. **Stack-buffer cstring for `open()`** + Replace `strings.clone_to_c_string(dir_path)` + `delete(cpath)` with a stack buffer copy: + ```odin + cbuf: [4096]u8 + copy(cbuf[:], dir_path) + cbuf[len(dir_path)] = 0 + fd, err := linux.open(cstring(raw_data(&cbuf[0])), ...) + ``` + + **Eliminates**: ~500K heap allocs for cstrings. Trivial change. + +3. **Arena for WorkItem paths** + Use a `Dynamic_Arena` or virtual-memory bump allocator for `join_path` results and `clone(entry_rel)` in WorkItems. Remove individual `delete(item.path)` / `delete(item.rel)` calls. Free arena once at end of `walk_stream`. + + **Eliminates**: ~1M individual alloc/free pairs for WorkItem paths/rels. + + **Challenge**: WorkItems cross thread boundaries via the queue, so the arena must be shared. A shared `Dynamic_Arena` needs synchronization on the bump pointer. Cleanest approach: `core:mem/virtual` to reserve a large address space (e.g. 256MB) and do `atomic_add_explicit(&offset, size, .Acquire)` for lock-free bump allocation. + +4. **TLSF as global allocator** + Swap `context.allocator` to TLSF at program start. O(1) alloc/free with good cache locality. ~5 lines of code. Best as a fallback if strategies 1-3 don't fully close the gap. + +### Other ideas + +5. **Lock-free MPMC queue** Replace Odin's mutex-based channel with a custom multi-producer-single-consumer ring buffer using atomics. Eliminates all futex syscalls on the result-transport hot path. **Design**: - - Fixed-capacity ring buffer of `[]string` slots (cap = `2 * thread_count`, same as now) + - Fixed-capacity ring buffer of `[]u8` slots (cap = `2 * thread_count`, same as now) - Producer side: each worker atomic-CASes a `head` counter forward to claim a slot index, writes its batch, then sets a `ready` flag on the slot - Consumer side: atomic-load `head`, drains all ready slots up to `head`, writes to stdout, frees batches - Backpressure: if `head - tail >= cap`, producer spins/waits (yields via `sched_yield` or `futex` with private flag) @@ -63,11 +104,8 @@ Channels gave the biggest single improvement since the project started. Cases 1, **Effort**: Medium. ~100-150 lines for the queue + a few tests. No changes to walker or main. -2. **Arena allocator per thread** - Bump allocator for all transient strings (result paths, work item paths), free once at exit. Would address the allocator efficiency gap. Bigger change, helps everywhere. - -3. **Buffer/sort output mode** (fd's approach) +6. **Buffer/sort output mode** (fd's approach) Buffer up to 1000 results (or 100ms deadline), sort them, then switch to streaming. Gives sorted output for small searches without sacrificing throughput on large ones. fd's `ReceiverMode::Buffering → Streaming` pattern. -4. **Git index parsing** +7. **Git index parsing** Parse `.git/index` binary format to show tracked dotfiles. Closes the 84-file correctness delta in cases 1/4. Last correctness gap. diff --git a/findr.odin b/findr.odin index 3d8a7b6..02f9f8f 100644 --- a/findr.odin +++ b/findr.odin @@ -7,7 +7,7 @@ import "core:sync/chan" import "core:thread" Writer_Data :: struct { - ch: chan.Chan([]string), + ch: chan.Chan([]u8), } output_writer :: proc(t: ^thread.Thread) { @@ -18,13 +18,8 @@ output_writer :: proc(t: ^thread.Thread) { defer bufio.writer_destroy(&w) for { - batch, ok := chan.recv(data.ch) - if !ok do break - for s in batch { - bufio.writer_write_string(&w, s) - bufio.writer_write_byte(&w, '\n') - delete(s) - } + batch := chan.recv(data.ch) or_break + bufio.writer_write(&w, batch) delete(batch) } bufio.writer_flush(&w) @@ -97,7 +92,7 @@ main :: proc() { thread_count := os.get_processor_core_count() - ch, _ := chan.create(chan.Chan([]string), max(2 * thread_count, 2), context.allocator) + ch, _ := chan.create(chan.Chan([]u8), max(2 * thread_count, 2), context.allocator) defer chan.destroy(ch) wdata := new(Writer_Data) diff --git a/walker.odin b/walker.odin index 662d02b..6fb6b41 100644 --- a/walker.odin +++ b/walker.odin @@ -9,7 +9,7 @@ import "core:sys/linux" import "core:text/regex" import "core:thread" -BATCH_SIZE :: 256 +OUTPUT_BUF_SIZE :: 64 * 1024 IgnoreMode :: enum { Respected, // skip gitignored, prune ignored dirs (fd -H default) @@ -38,31 +38,49 @@ WorkItem :: struct { } WalkerPool :: struct { - queue: [dynamic]WorkItem, - queue_mutex: sync.Mutex, - queue_sema: sync.Atomic_Sema, - result_chan: chan.Chan([]string), - active: i64, - done: sync.One_Shot_Event, - threads: []^thread.Thread, - opts: WalkOptions, - pattern_re: regex.Regular_Expression, - has_pattern: bool, - exclude_gi: ^Gitignore, - all_contexts: [dynamic]^GIContext, + queue: [dynamic]WorkItem, + queue_mutex: sync.Mutex, + queue_sema: sync.Atomic_Sema, + result_chan: chan.Chan([]u8), + active: i64, + done: sync.One_Shot_Event, + threads: []^thread.Thread, + opts: WalkOptions, + pattern_re: regex.Regular_Expression, + has_pattern: bool, + exclude_gi: ^Gitignore, + all_contexts: [dynamic]^GIContext, contexts_lock: sync.Mutex, } -flush_batch :: proc(ch: chan.Chan([]string), local: ^[dynamic]string) { +flush_buf :: proc(ch: chan.Chan([]u8), local: ^[dynamic]u8) { if len(local) == 0 do return batch := local[:] - local^ = make([dynamic]string, 0, BATCH_SIZE) + local^ = make([dynamic]u8, 0, OUTPUT_BUF_SIZE) chan.send(ch, batch) } +append_path :: proc(buf: ^[dynamic]u8, parent, name: string, trailing_slash: bool) { + need_sep := len(parent) > 0 && parent[len(parent) - 1] != '/' + size := len(parent) + len(name) + 1 + if need_sep do size += 1 + if trailing_slash do size += 1 + + old_len := len(buf) + reserve(buf, old_len + size) + resize(buf, old_len + size) + + pos := old_len + pos += copy(buf[pos:], parent) + if need_sep {buf[pos] = '/'; pos += 1} + pos += copy(buf[pos:], name) + if trailing_slash {buf[pos] = '/'; pos += 1} + buf[pos] = '\n' +} + walk_stream :: proc( roots: []string, - result_chan: chan.Chan([]string), + result_chan: chan.Chan([]u8), opts: WalkOptions, thread_count: int, ) { @@ -152,7 +170,7 @@ walk_stream :: proc( } Collector_Data :: struct { - ch: chan.Chan([]string), + ch: chan.Chan([]u8), results: ^[dynamic]string, } @@ -161,8 +179,15 @@ collect_worker :: proc(t: ^thread.Thread) { for { batch, ok := chan.recv(data.ch) if !ok do break - for s in batch { - append(data.results, s) + start := 0 + for i in 0 ..< len(batch) { + if batch[i] == '\n' { + if i > start { + s, _ := strings.clone(string(batch[start:i])) + append(data.results, s) + } + start = i + 1 + } } delete(batch) } @@ -171,7 +196,7 @@ collect_worker :: proc(t: ^thread.Thread) { walk :: proc(roots: []string, results: ^[dynamic]string, opts: WalkOptions, thread_count: int) { if len(roots) == 0 do return - ch, _ := chan.create(chan.Chan([]string), max(2 * thread_count, 2), context.allocator) + ch, _ := chan.create(chan.Chan([]u8), max(2 * thread_count, 2), context.allocator) defer chan.destroy(ch) data := new(Collector_Data) @@ -197,12 +222,12 @@ walk_worker :: proc(t: ^thread.Thread) { prof_thread_init("walker") defer prof_thread_destroy() - local_results := make([dynamic]string, 0, BATCH_SIZE) + local_buf := make([dynamic]u8, 0, OUTPUT_BUF_SIZE) defer { - if len(local_results) > 0 { - flush_batch(pool.result_chan, &local_results) + if len(local_buf) > 0 { + flush_buf(pool.result_chan, &local_buf) } - delete(local_results) + delete(local_buf) } for { @@ -221,12 +246,12 @@ walk_worker :: proc(t: ^thread.Thread) { ordered_remove(&pool.queue, last) sync.mutex_unlock(&pool.queue_mutex) - process_dir(pool, item, &local_results) + process_dir(pool, item, &local_buf) delete(item.path) if len(item.rel) > 0 {delete(item.rel)} - if len(local_results) >= BATCH_SIZE { - flush_batch(pool.result_chan, &local_results) + if len(local_buf) >= OUTPUT_BUF_SIZE { + flush_buf(pool.result_chan, &local_buf) } old := sync.atomic_sub_explicit(&pool.active, 1, .Release) @@ -236,7 +261,7 @@ walk_worker :: proc(t: ^thread.Thread) { } } -process_dir :: proc(pool: ^WalkerPool, item: WorkItem, local_results: ^[dynamic]string) { +process_dir :: proc(pool: ^WalkerPool, item: WorkItem, local_buf: ^[dynamic]u8) { dir_path := item.path cpath := strings.clone_to_cstring(dir_path) @@ -281,7 +306,7 @@ process_dir :: proc(pool: ^WalkerPool, item: WorkItem, local_results: ^[dynamic] gi_ctx = new_ctx } - buf: [32768]u8 + buf: [32 * 1024]u8 rel_buf: [4096]u8 for { @@ -321,8 +346,7 @@ process_dir :: proc(pool: ^WalkerPool, item: WorkItem, local_results: ^[dynamic] if is_dir { if should_emit && matches_pattern(pool, name) { - dir_path_out := join_path_dir(dir_path, name) - append(local_results, dir_path_out) + append_path(local_buf, dir_path, name, true) } if !ignored { child_rel, _ := strings.clone(entry_rel) @@ -339,8 +363,7 @@ process_dir :: proc(pool: ^WalkerPool, item: WorkItem, local_results: ^[dynamic] } } else if is_nondir { if should_emit && matches_pattern(pool, name) { - full_path := join_path(dir_path, name) - append(local_results, full_path) + append_path(local_buf, dir_path, name, false) } } }