feat: Replaced fd with custom internals.

This commit is contained in:
2026-06-16 20:57:38 -04:00
parent 159ff91938
commit 1fc5f8280e
10 changed files with 1141 additions and 2 deletions

177
findr/PLAN.md Normal file
View File

@@ -0,0 +1,177 @@
# findr — Gitignored File Finder
## Overview
findr is a native Odin tool that finds **gitignored files** within git repositories. It replaces envr's current approach of running `fd` twice (all files vs. unignored files) and diffing the results.
**Simplified scope:** findr does one thing — walks directories, finds git repos, reads each repo's `.gitignore`, and prints every gitignored file. No flags, no filtering, no pattern matching. envr handles result filtering itself.
## Current fd Usage in envr (being replaced)
1. **`scan.odin:13-43`** (`scan_path`) — runs `fd` twice per search path:
- Run 1: `fd -a <matcher> [-E <exclude>]... -HI <path>` → all files including gitignored
- Run 2: `fd -a <matcher> [-E <exclude>]... -H <path>` → hidden but NOT gitignored
- Diff = gitignored files only
2. Both go through `run_fd` (`scan.odin:68-118`), which spawns a subprocess and captures output via temp files.
After findr integration, `scan_path` calls `findr.walk(path)` directly — no subprocess, no double-run, no diff.
## Directory Structure
```
findr/
findr.odin # main + CLI (positional dir args only)
walker.odin # recursive directory walker using core:sys/linux getdents
gitignore.odin # .gitignore parsing + glob→regex transpilation + matching
test_env.odin # test harness: temp dir, mock filesystem, assert helpers
findr_test.odin # integration tests (10 tests)
gitignore_test.odin # transpilation + matching unit tests (22 tests)
```
## Decisions
- **Scope**: findr prints ALL gitignored files. No regex filtering, no exclude patterns, no type filters. envr post-processes the output.
- **Gitignore matching**: Transpile gitignore glob patterns to regex, then use `core:text/regex`. No dedicated glob matcher.
- **Stat avoidance**: Use `core:sys/linux` getdents directly — read `dirent.type` from the kernel, never call stat.
- **Architecture**: Separate directory with its own `main`. Core logic (`walk` proc + `gitignore` package) designed to be importable into envr later.
## CLI Interface
```
findr [dir1] [dir2] ...
```
No flags. Defaults to `.` if no dirs given. Prints absolute or relative paths (as given) to stdout, one per line.
## Build
```bash
odin build findr -o:speed -out:findr/findr
```
## How It Works
```
walk(dir):
entries = getdents(dir) # via core:sys/linux, zero stat calls
if entries contains ".git/":
gi = parse(.gitignore) # if present
for entry in entries:
if entry is gitignored file:
emit entry path
if entry is dir (not ignored):
walk(entry) # recurse to find nested repos
else:
for entry in entries:
if entry is dir:
walk(entry) # descend looking for repos
```
Key behaviors:
- **Nested repos**: When a repo is found, subdirectories are still traversed to find nested repos. Gitignored directories are pruned (not descended into).
- **Flat gitignore**: Only the root `.gitignore` is read. `.gitignore` files in subdirectories of a repo are ignored.
- **Non-repo dirs**: Traversed recursively to find repos. No gitignore rules apply.
## Performance Architecture
### Implemented
- **Stat avoidance via `dirent.type`** — Uses `core:sys/linux` getdents directly, bypassing `core:os` which calls `openat` + `fstat` per entry. File type comes free from the directory entry.
- **Prune ignored directories** — When a directory matches a gitignore pattern, it is not descended into. Skips potentially thousands of readdir calls.
### Future (if needed)
- Work-stealing parallel traversal (per-thread LIFO deques with batch stealing, like fd)
- BufWriter on stdout for large result sets
- Arena allocators for path strings
## Testing Strategy
- **In-process integration tests** — Tests call `walk()` directly (not via subprocess), build mock filesystems in temp dirs, and compare sorted output.
- **Unit tests** — Pure-function tests for glob→regex transpilation and gitignore matching.
- **Output sorting for determinism** — Always sort output lines before comparison.
- **Memory tracking** — Odin's test runner reports leaks automatically. All 32 tests pass with zero leaks.
### Test Coverage (findr_test.odin)
| Test | What it covers |
|---|---|
| `test_basic_gitignored` | Repo with `.gitignore`, gitignored files emitted, normal files skipped |
| `test_non_repo_not_scanned` | Dirs without `.git/` produce no output |
| `test_negation_pattern` | `!prod.env` un-ignores a file |
| `test_dir_only_pattern` | `node_modules/` pattern doesn't emit file results |
| `test_multiple_repos` | Multiple repos in one tree, each with its own `.gitignore` |
| `test_nested_repos` | Repo inside a repo, both scanned independently |
| `test_gitignore_in_subdir_ignored` | Subdirectory `.gitignore` files are not read |
| `test_no_gitignore_file` | Repo with `.git/` but no `.gitignore` produces nothing |
| `test_empty_gitignore` | Comments and blank lines only → no results |
| `test_multiple_search_dirs` | Multiple top-level search dirs in one call |
### Gitignore Unit Tests (gitignore_test.odin)
22 tests covering: simple/anchored patterns, `*`, `?`, `[abc]`, `[!abc]`, dot escaping, globstar variants, backslash escapes, empty patterns, basic matching, negation, dir-only, comments, blank lines, last-match-wins, env patterns.
## Glob→Regex Transpilation Rules
| Gitignore pattern | Regex | Notes |
|---|---|---|
| `foo` | `(^|/)foo(/.*)?$` | matches at any depth |
| `/foo` | `^foo(/.*)?$` | anchored to gitignore dir |
| `foo/` | `(^|/)foo/.*$` | directory only |
| `*.log` | `(^|/)[^/]*\.log$` | `*` = any chars except `/` |
| `**/foo` | `(^|/)(.*/)?foo(/.*)?$` | `**` = any chars including `/` |
| `foo/**/bar` | `(^|/)foo/(.*/)?bar(/.*)?$` | `**` between segments |
| `!pattern` | (handled by layer) | negation flag, not regex |
| `#comment` | (skipped) | |
| `[abc]` | `[abc]` | same regex syntax |
| `?` | `[^/]` | single char, no `/` |
## Implementation Phases
### Phase 1: Gitignore Transpiler + Tests ✅
**Goal:** Isolated, fully-tested glob→regex transpiler.
**Result:** 22 tests, all passing, zero leaks.
---
### Phase 2: findr Walker + Tests ✅
**Goal:** Working tool that finds gitignored files in git repos.
**Built:**
- `walker.odin` — Single-threaded DFS using `core:sys/linux` getdents. Finds repos, reads `.gitignore`, emits gitignored files, recurses into subdirs for nested repos.
- `findr.odin` — Minimal CLI: `findr [dirs...]`, no flags.
- `test_env.odin` — Test harness with temp dirs and mock filesystems.
- `findr_test.odin` — 10 integration tests.
**Result:** All 32 tests pass (22 gitignore + 10 walker), zero leaks.
---
### Phase 3: Parallel Traversal (future)
**Goal:** Parallelize directory descent for large trees.
---
### Phase 4: Benchmark (future)
**Goal:** Quantify performance vs fd on large directory trees.
---
### Phase 5: Integrate into envr (future)
**Goal:** Replace `run_fd` in `scan.odin`. `scan_path` calls `findr.walk()` directly instead of two subprocess runs + diff.
## Risks
| Risk | Mitigation |
|---|---|
| Single-threaded may be slow on huge trees | Add threading in Phase 3 after correctness |
| Gitignore edge cases (`**/foo`, `foo/**/bar`) | Comprehensive gitignore_test.odin with spec examples |
| dirent.type may be UNKNOWN on some filesystems | Fall back to stat only when type is UNKNOWN |
| Missing nested `.env` files in monorepos | Accepted limitation — flat gitignore model |
| Memory allocation churn from path strings | Use thread-local arena allocators in Phase 3 |