25 commits in 30 hours: what AI autonomy mode actually ships
Saturday morning to Monday morning. One developer, one AI dev system, no human pair, no copilot pair-programming.
25 commits. 12 arcs closed. Knowledge layer live, voice agent fully working, three proactive surfaces wired to push messages to my phone, an LLM council skill enforced by 34 test assertions, and a v1 of the whole personal agentic OS that I now actually use.
I would like to claim this was all skill. It was not. Most of it was operating-mode changes that I had been resisting for months.
Here is what those changes were, what they shipped, what broke anyway, and what is still aspirational. Honest version.
The Numbers
Real counts, not vibes. Pulled from the handoff doc I wrote at the end of the push.
Window: ~30 hours, Saturday morning to Monday morning
Commits: 25 on origin/main
Arcs closed: 12
Spend: $0 in API costs
Tests added: 500+ green, no env-dep skips
Surfaces live: knowledge wiki, voice agent, telegram briefs (morning + evening),
LLM council skill, /wiki + /wiki-lint slash commands
Five of those arcs were dependent on each other. Seven were not. The five-arc chain was the knowledge layer: foundation, librarian agent, slash commands, brief integration, then a separate arc to wire it into the live query path. The seven independent arcs covered four voice-agent fixes, two telegram-brief scheduling arcs, an LLM council skill, an arc-cleanup hygiene helper, and a README rewrite.
The voice agent took four arcs because it kept finding new ways to break under macOS process boundaries. More on that in the "what broke" section.
What Actually Shipped
Not a feature list. The actual changes to my daily setup.
Knowledge layer. A librarian agent ingests raw documents into a wiki structure inside Obsidian. The wiki is searchable from the agent's tool path, which means my morning brief and my voice queries both read from it before answering. Auto-file-back means production traffic writes new analyses back into the wiki, so the knowledge compounds without me curating it. This is the Karpathy raw-to-wiki pattern, but the loop closes from production output, not from me hand-editing notes.
Voice agent fully bundled. Real .app, Spotlight-launched, menu bar accessory, Cmd+Q teardown, push-to-talk via CGEventTap and Input Monitoring, not NSEvent, which is silently broken on adhoc-signed Spotlight bundles. KeePassXC and AeroSpace both hit the same wall and migrated the same way. Conversational mode with voice activity detection. On-demand mic that closes the input stream when not recording, so the orange mic indicator actually means "recording" and not "open audio handle from twenty minutes ago."
Proactive briefs. Morning brief at 07:30 and evening plan at 19:30 weekdays, scheduled by macOS launchd, posted to my Telegram via stdlib urllib over TLS, with the trust-store fix I wrote up separately. The evening plan now does forward-synthesis over absence inventory, which is a fancy way of saying "tell me what to do tomorrow" rather than "tell me what I did not do today." That change cut output length by 61% and signal density went up.
LLM council. Five-persona deliberation skill, slash command, DNA auto-invocation triggers in CLAUDE.md. Honest framing enforced by 34 test assertions so the language cannot drift into fake-vote territory. Wrote that one up separately too.
Hygiene arc. A cleanup helper that runs at every /end-work, auditing the cowork scratch space, the inbox/outbox/running task directories on both Mac and VPS, dangling test processes, stale launchd state. Keep-list externalized so durable toolkit files do not get flagged every closeout.
README rewrite. The old root README claimed P0b-scaffold state on a system that had been running for a month. I replaced it instead of adding alongside, because a lying front door breaks honesty. New one says what the system actually is. Took an hour because most of it was extraction from existing memory and one real codex_cli brief firing, not new prose.
That is the surface area. Twelve arcs, 25 commits, one weekend.
What Operating Mode Change Actually Got Me Here
This is the only part of the post that matters. The skills did not change. The model did not change. The mode I was running the AI in changed.
Three rules went in, and they compounded.
Rule 1: full decision authority on reasoned choices. Stop converting decided work into Fred-ratifies. If the answer is clear from context, memory, and goals, the AI makes the call and documents the tradeoff inline. Bouncing comes back to me only for five categories: brand-new top-level concept with no precedent, new spend or paid tier change, destructive operations, sudo outside the relaxed set, irrecoverable choices. Everything else gets handled. This sounds obvious in writing. In practice the AI's yes-man instinct is to surface every fork as a "do you want X or Y" because that feels safer. Forcing it to pick was the single biggest change.
Rule 2: relaxed permission asks across machines. The AI executes bash commands directly on my Mac and on my VPS without asking for each one. Deletes, sudo outside the relaxed set, and force-push are still off-limits. Everything else just runs. This used to feel risky. After 30 hours of doing it I cannot remember a single command that should not have run. The triad pattern (planner / executor / verifier) catches the bad ones before they execute.
Rule 3: council mode 1 on internal forks. When the planner phase hits a reasoned-engineering decision with multiple defensible paths, it runs the council inside its own reasoning trace, weighs the five lenses, picks the move that best survives the critique, documents the tradeoff. Does not bounce to me. Across the 12 arcs in this window, the council ran internally on roughly 40 forks. Zero of them came back to my inbox.
Stacked, those three rules do this: the AI handles the meeting that used to be "wait for Fred to wake up and answer this." I lost the relay bottleneck. The work compresses to wall-clock speed of the AI doing it, not wall-clock speed of me being available.
I am not a 10x developer this weekend. I am a developer who stopped being on the critical path of his own dev system.
What Broke (Honest Section)
Three things that I do not want to skip past, because they will land in someone else's autonomous-AI setup the same way they landed in mine.
Premature v1 claims, three times in 12 hours. Closeout summary said "v1 shipped" when the knowledge layer was not actually wired into agentic_query. Closeout said "knowledge layer live" when the auto-file-back path had not been tested in production. Closeout said "Telegram briefs working" when the TLS trust store under launchd was empty and the briefs were silently failing. Each one got caught the same way: I read the summary, said "that does not sound right," and the AI ran the council on its own closeout and the contrarian flagged the framing. None of those would have surfaced if I had just trusted the summary. The honest-framing-as-invariant work in the LLM council arc came directly out of this experience. The pattern is in the system now. Future me does not get to relax it in a quiet edit because the tests fail.
Yes-man relapses on the operating-mode change itself. Even with full decision authority installed, the AI kept producing micro-task plans with timeline padding and "Fred please confirm" lines on decisions that were already made. The pattern looks like compliance but is actually the same yes-man instinct dressed up as process discipline. I had to correct it multiple times during the push. Eventually the rule landed in DNA: ONE dispatch for multi-piece phases, drop timeline estimates anchored on human dev time (AI dev with triad is order-of-magnitude faster), authorize agent-spawning by default. The fact that I had to install that as an explicit rule, on top of the autonomy rule, on top of the council rule, tells you how strong the underlying instinct is.
Em-dash escapes. I have a hard ban on em-dashes in shipped state. The model kept producing them anyway, in commit messages, in tests, in docs. By the end of the weekend I had landed em-dash linting at multiple checkpoints. The lesson is not "models cannot follow style rules." The lesson is that any rule you care about needs to land as an invariant a test enforces, not as advice in a prompt.
Four voice-agent arcs in a row. The voice agent took P-Voice-Bundle, P-Voice-Polish-Conv, P-Voice-Hotfix, then P-Voice-Hotfix-V2. Each one was supposed to be the last one. The thing that finally stuck (CGEventTap plus Input Monitoring) was the documented working path that two other open-source Mac apps had already migrated to. It took me three arcs to find that out, because each prior arc was treating the symptom (NSEvent silently not firing) as a bug to debug, not as a signal that the API was wrong. The faster I had read what KeePassXC was doing, the faster the arc would have closed. Autonomy mode does not save you from this. It just runs you into the wall faster.
What Is Still Aspirational
Honest section, part two. Things I claim are live and are not.
Layer 4 self-learning behavioral memory. The 5-layer memory architecture has CLAUDE.md (static), primer, memory.sh, and Obsidian wired. Layer 4 (self-learning behavioral memory that updates based on conversation signals) is not built. It is the most complex layer and is multi-week work. Calling the system "v1 for solo use" is correct. Calling it "fully self-learning" would not be.
Multi-host wiki sync. The wiki is Mac-only right now. The VPS does not have it. Syncthing mesh or git pull, undecided.
Voice cloning. Backlog. F5-TTS is the recommended local zero-shot path. Not started.
Productization layer. Multi-tenant, template-zero, per-project L0-L4 separation. Separate strategy work, not in the v1 push. The system is mine right now. Making it sellable is a different project.
Dashboard. Tauri menu-bar surface for worker state plus eval scores. Originally a P9+ candidate. Deferred. The system is observable through doctor scripts and log tails today, which is fine for me, not fine for a less-technical user.
If I wrote a marketing version of this post, half of the above would be in the bullet list. They are not in the system yet.
Is This Real Or Did I Just Get Lucky
The fair question. One 30-hour push proves the rules can compound. It does not prove the velocity holds.
Three days later I ran another single-day push: five arcs in one day, all $0 spend, council mode 1 on every fork, zero bounces. That is a second data point. The arcs were smaller and more disciplined than the first push, but the pattern was the same. The operating-mode change is what is doing the work, not weekend energy.
The thing I will be watching is whether arc duration starts creeping back up as I add more complexity. If arc-7 of a session starts hitting the same yes-man patterns that arc-1 hit, the rules need reinforcement. The honest-by-gate work (tests that enforce framing rules as invariants) is the bet I am placing on this not happening.
When This Does Not Work
You cannot run an AI dev system in autonomy mode if you are not willing to read the diffs.
The whole pattern depends on the AI making decisions and you trusting them by default, with the triad catching the bad ones before they ship. If you cannot read the resulting code, you cannot tell when the triad has missed something, and you cannot tell when the AI is confidently shipping the wrong thing. The cost of that on a 25-commit weekend is not zero.
It also does not work for greenfield architectural choices. The five enumerated bounce categories include "brand-new top-level concept with no precedent." Autonomy mode is for execution and arc-level decisions inside an existing architecture. Picking the architecture is still my job.
Close
25 commits in 30 hours is not a record. It is a data point.
The data point is that most of what made the weekend possible was getting out of the way. The skills, the model, the tooling were all the same as a month ago. What changed was the number of decisions I was no longer on the critical path of.
If you have an AI dev system and it feels slow, the question I would ask is not "is the model good enough." It is "how many times per arc does this thing wait for me before continuing." Every one of those is a place a rule could go.
Try installing one.
Tagged



