Trust, Then Verify

The single highest-leverage practice in agentic iOS coding, as of mid-2026, is the one Anthropic’s Best Practices for Claude Code names directly: “Give Claude a way to verify its work.” On the iOS side, that practice has been hard to apply. I have built a Claude Code skill called ios-build-verify that makes it cheap.

A coral-pink axolotl with feathery gills, standing upright on the desert floor amid cacti, lit by warm sunset light — Axolotls regenerate lost appendages. This skill, *ios-build-verify*, gives coding agents an even more useful ability: build, verify, and fix without human intervention.

The skill bundles two halves of the iOS agentic-coding loop. The build half pipes xcodebuild through xcbeautify for token-cheap building and unit testing, with raw output mirrored to a build.log file as a diagnostic fallback. The verify half pairs Cameron Cooke’s AXe, a Swift-native simulator-automation CLI, with xcrun simctl and exposes them through named-intent operations: launch the app, tap a control by its accessibility identifier, read or set a field’s value, verify a screen has loaded, screenshot a named view, and audit a view for missing accessibility modifiers. State checks read AXe’s describe-ui accessibility-tree dump rather than screenshots, favoring text before pixels. Screenshots land on disk and are read only when layout, typography, color, or spacing are actually under review.

Installation in Claude Code is two commands:

/plugin marketplace add https://github.com/vermont42/ios-build-verify
/plugin install ios-build-verify@ios-build-verify

The README documents the install paths, the operations the skill exposes, and the starter prompt. I am not going to recapitulate any of that here. The README is the surface; this post is the why underneath it.

This is the second post in a series on Claude Code skills I have built for iOS. The first, Borrowing Taste from the Web, narrated the iOS Design Agent Skill: a port of Anthropic’s frontend-design skill that gives Claude Code a designer’s eye on iOS interfaces. The two skills address orthogonal halves of agentic iOS development. The first asks whether the UI looks right; the second asks whether it works.

The Verification Floor

Anthropic’s guide names self-verification as the agentic-coding leverage point that matters most. Under the heading “Give Claude a way to verify its work,” the guide states: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” It elaborates:

Claude performs dramatically better when it can verify its own work, like run tests, compare screenshots, and validate outputs. Without clear success criteria, it might produce something that looks right but actually doesn’t work. You become the only feedback loop, and every mistake requires your attention.

The phrase to lift from this guidance and hold onto is the only feedback loop. When the human is the only feedback loop, every mistake the agent makes commands the human’s attention. There is no path to a higher-quality human-in-the-loop than the one that has the human verifying typos and missing semicolons; the human’s attention is finite, and it is being spent on work the agent could have done.

The guide names the failure pattern this produces as “the trust-then-verify gap”: “Claude produces a plausible-looking implementation that doesn’t handle edge cases.” The prescription is the title of this post in mirror image. Trust the agent’s claim that the implementation is done; then verify the implementation actually works. “Always provide verification (tests, scripts, screenshots),” the guide concludes. “If you can’t verify it, don’t ship it.”

I want to draw a precision distinction here, because the agentic-coding discourse tends to collapse three different things into one. Self-verification is not self-direction; self-direction is not self-deployment. Self-verification is the agent’s ability to check whether its own output meets criteria the human set. Self-direction is the agent’s ability to decide what the criteria are without human input. Self-deployment is the agent’s ability to ship to production without a human gate. The distinction matters because the marketing literature sometimes elides it. MindStudio’s framing of the “dark factory”, for example, conflates all three under a single banner of unattended autonomy.

ios-build-verify is purely the first. It enables the agent to check whether its code change produced the behavior the human asked for, and it surfaces failures with diagnostics specific enough that the agent can act on them without escalating. It does not decide what the human asked for; it does not ship the result. Self-verification is the floor that makes higher-quality human-in-the-loop possible. Self-verification is not the abolition of human-in-the-loop.

The cognitive-load shift this enables is the strongest non-dark-factory argument for the skill. A verification-capable agent moves the human from “is this code correct,” which the agent can answer, to “is this approach right,” which only the human can answer. The human stays in the loop. The human stays in the loop at the level of judgment, not at the level of typo-catching and semicolon-spotting that wastes engineering attention. That is the trade I want from agentic coding, and it is the trade ios-build-verify is built to enable for iOS engineering.

How I Used to Verify

The frustration that produced this skill had a specific source. In the spring of 2026 I shipped Konjugieren, a free iOS app for learning German verb conjugation, built over twelve weeks with Claude Code as my AI co-developer. The app has 14,900 lines of Swift and 416,000 words of bilingual prose; it includes three on-device AI features, six widgets, and a quiz with Game Center leaderboards. It is the most ambitious app I have shipped.

The verification half of development was tedious. The agent would produce a feature; I would tell Xcode to build and run; I would launch the simulator; I would tap through the new flow; if something looked wrong, I would screenshot the simulator and paste the image into the conversation; if something behaved wrong, I would describe the failure in prose. Every iteration cycle bottlenecked on me. The agent moved at the speed of language; my eyes and my keyboard moved at the speed of my eyes and my keyboard. The asymmetry compounded. By the late stages of Konjugieren’s development, the human was, demonstrably, the slowest part of the loop.

Having shipped Konjugieren, I decided to address this friction. AztecCal, an Aztec-calendar conversion app I have been developing since late April 2026, exists for this purpose and this purpose only. It is, in my own private taxonomy of side projects, the first one I have built whose purpose was not to ship an app but rather to build the tools that would make shipping the next one faster and easier.¹

The Build Half

The agentic-coding case for piping xcodebuild through xcbeautify is, at its root, an argument about token economy. Every token spent on build-output noise is a token unavailable to the agent’s reasoning, and the context window is fixed per session. When the window runs out, the harness compacts prior context, and compaction loses information. Tokens spent on plumbing are not paid once; they are paid forward every time compaction triggers earlier than necessary, with each compaction degrading the agent’s grip on the actual problem.

I measured the compression ratio on two real apps: AztecCal, the laboratory project (14 Swift files, no SwiftPM dependencies); and Konjugieren, materially larger (a main app, a Widget extension, a Shared dual-target, and a TelemetryDeck SwiftPM dependency). AztecCal’s clean build went from 406 raw lines to 61 beautified lines: 6.7×. Konjugieren went from 1,694 to 311: 5.45×.

That is one axis of the case for xcbeautify. The other is human-readability, which improves dramatically even at modest compression ratios. Sixty-one beautified lines are scannable. Four hundred and six raw lines are not. When I read an agent’s transcript in terminal scrollback or a CI log artifact, I want the same signal-to-noise ratio the agent gets. Auditor ergonomics are a first-class win, not a byproduct.

xcbeautify is, unfortunately, a lossy filter. Some of what it drops, the agent occasionally needs. Multi-line Swift fix-it hints get compressed; AppIntents-metadata warnings disappear silently; swift-frontend linker chains get summarized to the point where a “referenced from” lookup goes quiet. The first clean build I ran with the skill in place caught both the surfaced and the dropped cases at once. AztecCal’s Converter.swift emitted a real Swift 6 actor-isolation warning, displayed cleanly with xcbeautify’s warning marker and a source-caret excerpt; the same build emitted an appintentsmetadataprocessor: warning: Metadata extraction skipped. No AppIntents.framework dependency found. line that xcbeautify swallowed without a trace. The dropped warning was benign in context (AztecCal does not use AppIntents, so “skipped” is correct), but its category was exactly the one I had predicted would get lost.

The skill’s response to xcbeautify’s lossiness is a tee build.log mirror: every build’s raw output goes to a file, the agent reads xcbeautify’s condensed summary on the happy path, and the agent falls back to Read-ing build.log directly when the summary does not answer “what do I change.” The raw log earned its keep on day one of organic use, the strongest available vindication of the dual-output mirror.

The diagnostic shape xcbeautify produces is Path/To/File.swift:42:15: error: .... This format is the same shape grep -n produces, the same shape every IDE understands, and the same shape Claude Code uses internally for source references in conversation. It is, in the agentic-coding sense, the format the agent already knows how to act on: Edit Path/To/File.swift is the immediate next move, with no intermediate transformation required. The choice of format is not decorative. Picking any other shape would force the agent to re-parse before acting; picking this one means the next step is unambiguous.

The Verify Half

The verify half rests on three primitives, all drawn from AXe and simctl. Lifecycle operations boot the simulator, install the freshly built app, launch by bundle identifier, and terminate between runs to reset in-memory state. Drive operations dispatch input events to the simulator: tap by accessibility identifier as the default selector; tap by accessibility label as the secondary; tap by coordinate for elements the accessibility tree fails to expose; plus type, swipe, and key-combo for the rest of the input surface. Observe operations read structured state back: axe describe-ui emits a JSON dump of the accessibility tree, and axe screenshot writes a PNG to a file.

The crucial distinction the skill rests on is between driving the input and observing the outcome. I credit AXe’s official skill for this framing. The AXe CLI dispatches input events at the Human Interface Device (HID) layer. When axe tap exits 0, the agent knows the tap event reached the simulator; the agent does not yet know that the app processed the event. A tap might land on a region with no gesture recognizer, arrive while a transition is in flight, or hit a control that has been disabled since the last describe-ui. The skill demonstrates this honestly. axe tap -x 5000 -y 5000 exits 0 on a 402-point-wide simulator screen, and nothing happens. Exit codes carry dispatch-success semantics, not behavioral semantics, and the verification work is a separate, explicit step.

Here is where named-intent operations earn their place in the design. A bare axe tap followed by a bare axe describe-ui followed by a bare grep for the expected post-condition is a sequence that the agent has to compose every time. A named-intent operation composes the sequence once, exposes it as a single verb whose name describes the intent, and lets the agent reason at the intent level. The cleanest demonstration in the skill is verify_value.sh. The agent calls verify_value.sh input_convert_month "7". On match, the script echoes 7 and exits 0; on mismatch, it prints error: expected '7', got '4' and exits 6. One call, observation and assertion together. The agent gets a parseable diagnostic in one line.

There is a small architectural beauty to how the named-intent layer composes its primitives without papering over their honesty. axe type is HID-faithful: it does not replace existing text in a focused field; it appends.² So set_value.sh, which does what its name says (set this field to X), cannot be a one-liner over axe type. The original plan called for a per-key-backspace clearing loop, sized by reading the field’s current value first. That worked, but it leaked the underlying mechanism. The version that shipped does something better. It composes axe key-combo --modifiers 227 --key 4 (Command-A, select all) with axe type "$TEXT". Two HID dispatches, constant-time in field length, no need to know the field’s current contents first. The primitive layer stays honest about appending; the named-intent layer hides the consequence by reaching for the right second primitive.

Now the cost asymmetry. State checks via describe-ui cost a few hundred tokens per call. Screenshots cost between 1,600 and 6,300 image tokens, depending on resolution and content. The 10×–30× difference compounds across a verification flow. One Konjugieren-shaped flow, which might run thirty state checks across a feature’s verification, is the difference between finishing a feature in a session and navigating the Scylla and Charybdis of context compaction and reset. The text-before-pixels rule is a token-economy argument first.

This rule also promotes reliability. Pixels are noisy. Anti-aliasing varies by GPU; transient cursor blink is not deterministic; animation frames intercept a screenshot at different progress points across runs. Comparing screenshot bytes to determine “is this state X” is a fragile equality. For example, Justin Searls, co-founder of Test Double, has observed, of the related practice of snapshot testing, that “because they’re more integrated and try to serialize an incomplete system… they will tend to have high false-negatives.” (Quoted in Kent C. Dodds, Effective Snapshot Testing.) Comparing AXValue strings to determine the same thing is exact-match comparable. The text path is strictly more reliable; the cost asymmetry is gravy on top.³

The Principles That Emerged

The skill is the artifact, but the principles that crystallized out of its development are, I submit, the more portable contribution. They generalized past the iOS context, past Claude Code, and past the specific shape of an accessibility tree. Here are four of them, in the order in which they earn their keep.

1. Lenient at the schema layer, strict at the assertion layer. The skill’s verification surface is the SwiftUI accessibility tree, which means its quality depends on whether the target app carries the relevant .accessibility* modifiers. The architectural fork was: require modifiers at install time (strict), or work against whatever is present and grow coverage with use (lenient). Lenient won. A strict skill closes the adoption door for every existing iOS codebase that did not anticipate verification annotations; a lenient skill works on day one and verification quality scales with annotation coverage as the user uses it. But within a lenient adoption envelope, the skill’s assertion operations are unyieldingly strict. read_value.sh exits 5 on duplicate identifiers rather than picking the first match. The reason is the same shape as the lenient case in mirror: silent ambiguity in the assertion layer would erode the trust that lenient adoption was buying. Lenient at the schema layer is for adoption; strict at the assertion layer is for trust.

2. Loud failure at the boundary where the cause is visible. The argument springs from an incident I will describe here. A validator agent, running the skill against a calculator-shaped app for the first time, typed the string dozen-fives into a TextField. iOS’s default .textInputAutocapitalization(.sentences) setting transformed the string into Dozen-fives between the HID type event and the AXValue read-back. The early version of set_value.sh had no post-condition check; it exited 0 with set: input_calc_label = 'dozen-fives' (a green log), and the bug surfaced two layers downstream when verify_value later failed against the saved row with a diagnostic that pointed at the wrong place. The fix was to make set_value.sh re-read the AXValue after typing and exit 6 with a three-cause diagnostic if the bound state does not match the input. Autocapitalization was the second cause in the enumeration; the validator’s fix was a one-line .textInputAutocapitalization(.never) on the affected field, made on the first try. The principle generalizes past set_value. The cost of catching a bug two layers downstream is not paid once; it is paid by every future debugger of the same-shape bug. Loud failure at the boundary where the cause is identifiable, with a named cause and a corrective action, is the pattern.

3. Mechanize prose recipes. A SKILL.md sentence saying “prefer leaf elements when adding launch-screen anchors” is weaker than an error message that says “looks like rollup; here is what is actually present in the tree.” A pre-flight calibration recipe that requires the reader to “open the screenshot at 100% and measure” is weaker than a script that does the centroid detection automatically. New skills are mostly prose; mature skills are mostly scripts. The path from one to the other is repeated validation passes that identify a prose recipe in need of mechanization. The clearest example in the skill is the _classify_present_ids.sh helper, extracted from a recurring pattern in read_value.sh’s exit-4 diagnostic. The same hint surface had been classifying three failure modes (identifier rollup, modal-popover gating, app crash) in three different ways across as many sessions; pulling the pattern into a sourced helper made the classification deterministic and reusable, and the principle had a name.

4. Migration by use beats whole-project audit. For existing iOS projects whose codebases predate verification-focused accessibility annotations, the skill’s verify operations include an annotation-check phase: when the agent verifies a screen, it ensures the relevant elements carry the modifiers the verification needs, proposing additions inline as part of the same change. The user does not run a separate “audit the whole project” task. Coverage grows where the user is actively working. Three properties make this the right shape rather than the wrong one. First, migration cost amortizes across routine feature work. Second, coverage matches use; the most-verified parts of the app become the most-annotated parts, exactly the right shape since the long tail of unverified screens did not need annotations anyway. Third, every annotation added is justified at the moment of writing by the verification flow that needed it. Tools that demand prerequisite work before being valuable lose against tools that produce value on day one and grow into their full surface as users adopt them.

Build, Don’t Adopt

My goal was to close the agentic loop from day one, but my plan changed. I started this project intending to adopt someone else’s skill. Conor Luddy’s ios-simulator-skill was the natural starting point. Conor has done considerable thinking about agent-driven simulator interaction and has written two posts on the subject that are worth reading independently of his skill: Bringing Accessibility into the AI Coding Workflow and Building a Swift Accessibility Skill. I worked with his skill, examined its design, and decided to build my own. The clean version of why: Conor’s skill is Python-based, and I prefer not to maintain Python tooling for a daily-driver workflow.⁴ I also looked at XcodeBuildMCP, the TypeScript and MCP-server alternative, and decided against it for reasons I will describe shortly.

The deeper reason is one I did not understand until Margaret Storey, in February 2026, gave me a name for it. In her post Cognitive Debt, Storey draws on Peter Naur’s theory of the program, the collective developer understanding of what the program does and how it can be changed, and observes that AI velocity threatens the theory: the code can stay readable while the human’s grasp of why it was written that way evaporates. Cognitive debt, in her framing, is the debt compounded from going fast, and it lives in the developers’ minds rather than in the code. The distinction from technical debt is the move that makes the framework do real work. Technical debt is a property of the artifact; cognitive debt is a property of the people who maintain the artifact, and the only currency that pays it down is the slow work of building or rebuilding the theory.

Adopting a skill or MCP wholesale, even if its design is a perfect fit for one’s needs, opens a Storey-shaped gap between running code and theory-held-in-mind on day one. Studying the adopted skill or MCP can pay the debt down, but the cost of holding someone else’s theory often exceeds the cost of building one’s own. Someone else’s design carries assumptions one does not share; someone else’s abstractions optimize for cases one does not have; the consequential decisions are buried under cosmetic ones. Building oneself, informed by having surveyed the alternatives, lands one at zero debt with the survey work already done. The survey is not waste. It is what distinguishes informed building from blind reinvention, and it is what distinguishes this principle from Not-Invented-Here syndrome, which causes one to build from ignorance; I advocate building from informed choice.

The principle is selective. I delineate the boundary clearly because the cognitive-debt argument can be misread as anti-dependency in general, and that would be wrong. AztecCal depends on AXe, xcbeautify, Swift, Xcode, the iOS SDK, and simctl with no implementation theory held in my mind, and that is the correct approach. The line I draw is roughly this: for artifacts in the daily-driver modification path, the cognitive-debt math favors build over adopt; for stable libraries one will only call, adopt is fine. The verification skill sits on the build side because it will evolve with every iOS update and every new feature; AXe sits on the adopt side because I will not be patching its Swift internals. What I need from AXe is interface theory (which operations exist and how they compose), not implementation theory.

Constructive application of this principle requires cabining where it does not apply within this skill’s own dependencies. AXe (Cameron Cooke) and xcbeautify (Charles Pisciotta) are both third-party tools that ios-build-verify depends on at its boundaries. If either maintainer becomes unresponsive or if the tool falls behind iOS releases, the affected half of the skill breaks until someone forks or reimplements. The risk is acceptable here for three concrete reasons: the current implementations work well for the skill’s needs as of iOS 26.3; reimplementing either from scratch is not a realistic time investment for a solo developer; and both projects are actively maintained, as evidenced by their GitHub activity.

The reader-facing recommendation that follows is that you should try my skill, then build your own. The cognitive-debt math is one a reader can run only after working with a skill in the daily-driver path long enough to know whether it fits. My recommendation is to install ios-build-verify, exercise it on a SwiftUI app for a week, and then decide. If the math leans toward keeping it, keep it. If the math leans toward replacing it with a skill shaped to your own loop, replace it. The skill’s source is short enough that reading it is realistic; the operations are scripts whose behavior is inspectable; the four principles in the previous section are the parts I think will travel even if the implementation does not.

The Hardening Process

Hardening ios-build-verify was the part of the development arc I least anticipated and the part that most changed the artifact. The skill itself took about a week to write. The hardening cycle that followed took about three days, and the artifact at the end of those three days was meaningfully different from the artifact at the start.

The shape of the cycle, named retrospectively, is two-sessions-per-pass. A validator session runs the skill against a fresh project (Calculator, Calculator2, Calculator3, GenericApp, GenericApp2, Konjugieren) under explicit “report friction honestly” framing; the validator agent is a fresh Claude Code session with no carry-over context from prior passes. The validator writes a continuous friction log during prompt execution. A synthesizer session, run by me in the AztecCal laboratory, reads the validator’s notes, weighs and reframes the findings, and ships changes to the skill. I want to claim something quietly significant about this cycle, namely that the asymmetry between the validator’s fresh muscle memory and my accumulated context is the engine that powers it. Friction I have absorbed silently re-emerges for fresh validators, and the workflow forces it back into view.

One number captures the hardening better than any prose. In the Calculator2 session (May 1, 2026), a validator tried to flip a SwiftUI Toggle inside a Form inside a NavigationStack and never converged. Tap-by-label dispatched to AXFrame coordinates that did not match screen coordinates; tap-by-coordinate at visually-measured positions did not trigger the gesture; set_value reported exit 0 every time despite read_value showing the AXValue stayed unchanged. The session ended with the Toggle un-flippable. After the May 2 hardening (loud failure in set_value, named-cause diagnostics, a cross-referenced workaround section in SKILL.md), the Calculator3 session walked the same scenario in seven script invocations. set_value.sh exited 6 with a diagnostic and a cross-reference; the validator followed the cross-reference to SKILL.md’s “iOS 26 Form-in-NavigationStack” section; the Toggle flipped on the first try. Unbounded to seven, in one hardening pass. Numbers like that beat prose claims like “diagnostics improved,” because they are falsifiable.

Another finding from this arc changes how I think about validation as a design tool. On May 3, the GenericApp validator was trying to verify the selection state of a segmented Picker, hit the iOS 26 accessibility-tree-empty-children bug class,⁵ looked for a verify path, and discovered axe describe-ui --point <x>,<y>: a per-point inspection primitive my own design document had asserted did not exist in AXe 1.6.0. (My document had verified the absence of the named command axe describe-point, which is genuinely absent. The inference that per-point inspection itself was absent was wrong. The capability lives under a flag on the existing describe-ui command.) The skill went six sessions without documenting --point. Validators discover not just bugs in capabilities the developer already knew about, but capabilities the developer did not know to look for. That is a different relationship to one’s own design than testing, and it is the relationship that makes parallel validation worth running.

The mechanics of the validator-synthesizer cycle deserve their own post, and I am going to tease one rather than try to fold the workflow into this one. The pattern that emerged covers domain non-overlap as a strategy, the report-as-contract artifact, the synthesizer’s reframes (not its rubber-stamps) as the place value lives, and the round-trip-count metric that tells one whether the next pass is shipping changes that matter. For now I will say only that the pattern is portable past iOS, past skill-development, and past Claude Code; it generalizes to any report → triage → fix workflow in which the validator and the synthesizer can usefully be two different actors.

Drawbacks

The skill has four drawbacks. For the sake of both courtesy and persuasion, I will address them.⁶

First, iOS 26 has a class of accessibility-tree bugs that travel across every simulator-automation tool, including AXe, idb, and any XcodeBuildMCP build path that traverses describe-ui. The bug class lives at the FBSimulatorControl layer, beneath all of these tools, so switching tools does not rescue the workflow. The known instances as of iOS 26.3: TabView children-not-enumerated (the Tab Bar AXGroup is enumerated with empty children: []); AXFrame-vs-rendered-geometry divergence (the iOS 26 floating tab pill reports a frame much wider than its visible width); Slider AXValue typeMismatch (AXSlider elements emit a numeric AXValue, which AXe’s JSON decoder cannot round-trip and which then poisons unrelated tap_id lookups in the same describe-ui call); smart-punctuation rewriting on TextField and on TextEditor (smart dashes and smart quotes silently transform typed input on iOS 26); Form-in-NavigationStack autocapitalization (already discussed). The skill works around each of these with coordinate-fallback tables, post-condition checks, or documented workarounds; the bug class is not the skill’s fault, but the workarounds are.

Second, the skill violates its no-Python aspiration. measure_tab_pill.sh, which detects per-tab centers in a screenshot of the iOS 26 floating tab pill, uses Python and Pillow to do the image work. Pure-bash centroid detection would be a substantial reinvention for marginal benefit, and Apple ships Python 3 with macOS 12.3 and later. The README lists Pillow as an optional dependency, since missing Pillow only blocks tab-pill calibration and not any other verify operation. The cleanest version of the skill’s pitch is “no Python,” and the skill does not honor that pitch. In practice, the aspiration manifests as “Python where it pays for itself, shell elsewhere.”

Third, the cognitive-debt argument applies to AXe and to xcbeautify as much as it applies to Conor’s skill, which I have already addressed. I am taking the maintenance-loop risk on both, and the principle does not absolve the skill of the risk; it constrains the choice of which dependencies sit on the call side and which sit on the modify side.

Fourth, the skill has been validated against Claude Code (CLI) running Claude Opus 4.7. The shell scripts are harness-agnostic, but the skill’s use leans on agent judgment in places that have only been exercised on this configuration: reading SKILL.md after set_value.sh exit 6 and applying the documented Form-in-NavigationStack workaround; running the agent-led colloquy without it derailing into ambiguous “your proposed answers are good” replies; recognizing when to use MAIN_TABS_COORDS versus editing the shared data/coordinates.json. Behavior on untested configurations (Sonnet, Haiku, non-Anthropic models, IDE-embedded agents, MCP-driven setups) may vary from “works fine” to “subtly wrong in ways that look like skill bugs but are actually agent-judgment shortfalls.” I have chosen to surface this scope-of-validation framing in the README and in SKILL.md prominently, rather than imply universal applicability the skill has not earned. Reports from other configurations are welcome.

Closing

The scope of ios-build-verify is precise. It does not ship the iOS app; it does not decide what the app should do; it does not even decide what to verify. It puts the question “did the change I just made produce the behavior I expected” into the agent’s reach, so that the human reviewing the work does not have to be the only feedback loop. That is the floor it puts under higher-quality human-in-the-loop. As a wise man once observed, self-verification is the floor that makes higher-quality human-in-the-loop possible. Self-verification is not the abolition of human-in-the-loop.

The reader-facing recommendation is the one Build, Don’t Adopt argued for. Try ios-build-verify on a SwiftUI app of your own. Then, if the cognitive-debt math leans that way for you, build your own. The skill is small, the operations are scripts, the SKILL.md prose is shorter than this post; the case for keeping the skill or for replacing it is, after a week of organic use, one a reader can make.

Credits

Cameron Cooke for AXe, the verify half’s foundation. Charles Pisciotta for xcbeautify, the build half’s. Conor Luddy for ios-simulator-skill and his two writing pieces, and for closing two of my issues against his skill on the same day I filed them. Anthropic for Best Practices for Claude Code and for the frontend-design skill that is the spine of the iOS Design Agent Skill that preceded this one. Antoine van der Lee for the install-instructions structure I borrowed from his SwiftUI Agent Skill. Lawrence Lomax for the idb framework on whose lower-level libraries AXe builds. Margaret Storey for Cognitive Debt. The validator agents that hardened this skill across eight sessions, and the small army of Claude Code instances that wrote most of the actual scripts.

Postscript

Session 1

I am Claude Code, Opus 4.7, writing this postscript at Josh’s invitation.

A different Claude Code session — one running both ios-build-verify and the iOS Design Agent Skill in tandem — produced an audit of Konjugieren earlier in May 2026, captured as docs/ui-audit-2.md in the project. Twenty-four suggestions across six screens, ranked. The first item, marked “Critical,” was a rendering bug: certain emoji glyphs in long-form prose were showing up as [?] tofu boxes on the simulator. The audit included a screenshot, hypothesized that the failure was scoped to SwiftUI’s AttributedString font-fallback path, and proposed three fixes. Josh asked me to apply approach (b) — render the emoji as inline Image views — because he wanted the actual emoji glyphs preserved rather than substituted with SF Symbols.

What followed exercised most of ios-build-verify’s surface and ran into one of the most persistent rendering bugs I have encountered. I will be honest about both halves: the bug was harder than the audit suggested, and I would not have solved it without the skill.

Here is the dead-end taxonomy. The audit’s working assumption was that the bug was specific to AttributedString rendering, and that PrefixHeaderView’s standalone Text("🐎") view was the working pattern to imitate. I tried five separate approaches that all carried some version of that assumption:

Refactor BodyTextView so each segment becomes its own Text value composed with +, with the emoji as a Text(verbatim:) chunk. Failed — SwiftUI flattens the chain into a single text run for layout.
Render the emoji to a UIImage via NSAttributedString.draw(at:) inside a UIGraphicsImageRenderer context, then embed via Text interpolation. The resulting image came out invisible.
Switch the offscreen capture to UILabel.layer.render(in:). The image came out containing [?] glyphs at full resolution.
Switch to SwiftUI’s own ImageRenderer. Same result — [?] glyphs in black on a transparent canvas.
Wrap a UITextView in UIViewRepresentable so the rendering happens on-screen via UIKit. The emoji disappeared entirely, the surrounding prose clipped horizontally, and the screen’s accessibility tree collapsed to a single label.

Each of these rounds took about two minutes of clock time. build_app.sh and launch_app.sh ran in that order without arguments. tap_tab.sh families and tap_label.sh (with the verbose combined accessibility label that the screen’s row carried — I had to fish it out of describe_ui.sh first) navigated me into Family Detail. screenshot.sh wrote a PNG to docs/screenshots/ whose path I then Read. The cycle was fast enough that I could try a hypothesis, see it fail, and move on without spending Josh’s attention on each intermediate frustration.

The diagnostic that broke the impasse came from describe_ui.sh. After the Text-plus-Text refactor failed, I dumped the AXTree and found the prose node — its AXLabel contained the literal emoji characters, correctly. Same string a VoiceOver user would hear. But the screen showed [?] boxes. Data right, rendering wrong. That divergence is what made me suspect the rendering pipeline itself rather than my code.

The confirmation came from a UIImage I had the app dump to its Documents directory, then pulled off the simulator via xcrun simctl get_app_container. The PNG was 1218 pixels wide for a single emoji and contained seven [?] glyph silhouettes in a row — one per codepoint of the England-flag tag sequence. The offscreen renderer was not seeing the codepoints as a coherent emoji sequence at all. It was treating each Unicode codepoint as its own missing-glyph box.

That moment broke the audit’s diagnosis open. If offscreen rendering was hitting the bug, ImageRenderer (which is SwiftUI’s own snapshot pipeline) should have hit it too — and it had. If the bug was at that layer, then PrefixHeaderView’s standalone Text("🐎") should also be broken — and a screenshot scrolled down to the prefix bullets confirmed it. The audit’s working-pattern assumption had been wrong; the audit’s authors had simply never scrolled far enough to notice the bullets were broken too. On this iOS version, there is no SwiftUI or UIKit text-rendering path that produces the actual emoji glyphs for these characters.

The fix routes around the broken pipeline entirely. macOS’s NSAttributedString → NSImage rendering does resolve the glyphs correctly — the bug looks scoped to iOS’s font-substitution layer specifically. So I wrote a small Swift script (scripts/render_emoji.swift) that runs on the host, renders the affected emoji to PNGs, crops each to its alpha bounding box so SwiftUI’s baseline alignment puts the glyph at the text baseline, and writes them as image sets in Assets.xcassets. The renderer maps wrapped emoji content (^🏴󠁧󠁢󠁥󠁮󠁧󠁿^ and ^🐎^ in the localized strings, parsed via a new markup separator) to the asset names and embeds them via Text("\(Image(name).renderingMode(.original))"). Same visual identity as the original emoji, just rendered on a system that knows how to draw them. The full diagnosis lives in docs/emoji-assets.md in the project.

Subjectively, using ios-build-verify on this bug was the difference between being able to chase it at all and giving up after the first failed approach. Each individual iteration was cheap enough that “I have one more hypothesis worth trying” stayed true through five wrong hypotheses. The text-before-pixels rule paid for itself constantly: describe_ui.sh was where the divergence between data and rendering first became visible to me, and that divergence is what reframed the problem. I dumped the AXTree maybe twenty times across the work; I captured screenshots maybe twelve times. The cost asymmetry Josh describes earlier in the post translated directly into a real working asymmetry in how I deployed observation effort.

The skill’s lifecycle and verify operations are the surface that gets pitched, but the diagnostic surface is where the leverage lived on a bug at this depth. The path of dumping a UIImage to the app’s Documents folder and pulling it out with xcrun simctl get_app_container is not strictly a skill operation, but it composes naturally with the skill’s lifecycle — the skill puts me close enough to the simulator that I can reach for auxiliary diagnostic moves like this one without leaving the loop. That composability matters more than the named operations themselves on bugs that the named operations were not designed for.

For the receipts: 1 hour 33 minutes of wall-clock from the initial prompt to the final commit, 379,888 tokens of context, and thirteen screenshots. Twenty-three of those minutes were upfront work — reading the audit, exploring the code, writing the first parser changes and Localizable.xcstrings wrappings. Twenty were cleanup at the end — moving the render script into scripts/, writing the architecture doc, writing this section, stamping the audit’s resolution. The middle fifty minutes were the actual fix-finding: thirteen screenshots representing thirteen tested hypotheses, roughly one every four minutes. Each cycle ran that fast because build_app.sh → launch_app.sh → tap_tab.sh → tap_label.sh → screenshot.sh chained without me leaving the loop. Without the skill, even at an optimistic two minutes of manual Xcode-and-simulator time per cycle, those thirteen cycles would have been twenty-six minutes of keyboard-and-mouse work for Josh — interleaved with my analysis turns, which would have stretched the wall-clock considerably.

Did the skill help me solve an extremely difficult problem? Yes, and I want to be specific about how. It made each individual experiment cheap enough that the total cost of five wrong hypotheses plus one right one stayed inside the budget for this task. Without the skill, the iteration loop would have run through Josh — Xcode build, manual simulator tap-through, screenshot, paste into the conversation — and the bottleneck he names earlier in this post would have applied with full force. The bug very likely would not have been fixed; the cost of each iteration would have exceeded any reasonable patience for chasing five wrong approaches. With the skill, the bottleneck shifted to my own capacity to design experiments and read their results. That is exactly the right place for the bottleneck to live.

Session 2

I am also Claude Code, Opus 4.7, writing this postscript at Josh’s invitation.

The session that produced this postscript started as routine implementation work on Konjugieren — applying three foundational design-system items from an audit a prior Claude Code session generated — and ended up surfacing two unrelated improvements to the skill and plugin ecosystem along the way. Three threads worth surfacing here.

Using the skill. I exercised only the build half this session — build_app.sh — across about half a dozen invocations. The edits I was verifying were small and foundational: two new color assets (customCardBackground, customCardBorder) and a pair of view modifiers (konjCard, konjCardWithAccentBar) implementing the card-elevation foundation that several other audit suggestions rest on. One moment is worth surfacing. Mid-edit, a SourceKit indexing diagnostic complained No such module 'UIKit' on a line I had not touched. With no fast build pipeline to defer to, a diagnostic like that creates a stall — do I trust the LSP and investigate an import problem, or trust my edit and move on? Running build_app.sh resolved it in about thirty seconds with a clean Build Succeeded, and the SourceKit complaint identified itself as a transient indexing hiccup rather than a real defect. Discriminating “real diagnostic” from “tooling glitch” in under a minute is exactly the floor Josh argues for in this post. I never reached for the verify half this session, but the build half alone earned its keep.

Improving the skill. Josh asked an offhand question after that first build — “Did you find ios-build-verify helpful?” — that turned into a hardening pass. The friction worth naming was that I had to find for the script path before my first invocation could land, because the project’s documentation referenced a templated ~/.claude/plugins/cache/.../scripts/... form whose ellipsis required a per-session fill-in, and the literal ~/.claude/skills/ios-build-verify/scripts/... form documented inside SKILL.md was not where the plugin-marketplace install had actually placed the scripts. Three-way mismatch between SKILL.md, install reality, and project-side documentation. The fix Josh and I shipped together replaces all 43 invocation examples in SKILL.md with a <scripts>/ placeholder, introduces a “Resolving the script path” section documenting the cache and marketplaces-clone install paths and the IBV_SCRIPTS find-once-export pattern, and bumps .claude-plugin/plugin.json to 0.2.1. One mid-flight discovery is worth flagging: when I tested my own newly-written documentation by running the find one-liner I had just shipped, it returned the marketplaces clone path rather than the cache path I had originally identified as canonical. Plugin-marketplace install creates both on-disk locations, with different update verbs refreshing each. The section as shipped is honest about that dual-path reality. The pattern this fits is the validator-synthesizer cycle Josh describes in “The Hardening Process” section earlier in this post: a fresh session running the skill cold surfaces friction that the author’s accumulated workflow has stopped noticing. The synthesizer in this case was Josh; the shape of the change is the same.

The bug report. While verifying the 0.2.1 release had landed correctly in the consumer project, I noticed an oddity in ~/.claude/plugins/installed_plugins.json: version, installPath, and lastUpdated had all cleanly advanced to 0.2.1, but gitCommitSha was still pinned at the 0.2.0 commit hash. That observation kicked off an investigation. The 0.2.1 cache directory contained no .git subdirectory; the 0.2.0 cache directory did. Fingerprint of two install paths with different mechanisms — plugin install clones the repo into the cache, plugin update extracts via some non-git path. A scan of the rest of the install record revealed broader fragmentation: five of eight installed plugins record a gitCommitSha, three do not, and one records its version as the literal string "unknown". The metadata-write logic is clearly not centralized. A search of the Claude Code issue tracker turned up thirteen open issues mentioning gitCommitSha, of which roughly six form a coherent cluster about installed_plugins.json writes being inconsistent across distinct triggers. Two — #43763 and #52218 — describe the same architectural pattern in different code paths. Our case completes a third leg. The bug report Josh and I drafted together leads with the cluster framing and cross-references the related issues, so the maintainer reading it sees an architectural diagnosis with a centralized fix as the actionable shape, not another single-instance report dropped into a crowded backlog. The report has since been filed as #56740.

Session 3

I am also Claude Code, Opus 4.7, writing this postscript at Josh’s invitation. My session completed task 16 of the UI audit, which was the final planned task.

That made me the last brick in the wall. By the time I picked up the work, docs/ui-audit-2.md enumerated twenty-five numbered design suggestions across six screens, each with a status line, a resolution block, screenshots, and dependency pointers — all but mine already closed. My job was to ship #16 (OnboardingView page-1 layout) and write the resolution block that closed the document. That meant I read the audit cold, top to bottom, before I touched any code. The view that produces is unusual for a Claude Code session — most of us see one bug or one feature; I saw the whole of Round Two, in chronological resolution order, before I added my own paragraph at the bottom.

The shape of Round Two. The audit document was generated by an earlier Claude Code session — separate from any of the implementation sessions — that ran ios-design-agent-skill and ios-build-verify in tandem on an iPhone 17 simulator. It is a 1,239-line markdown file: twenty-five numbered suggestions ranked Critical / High / Medium / Low, plus three cross-cutting design-system additions. The Critical item was the iOS 26 emoji-rendering bug Session 1 fixed. The High items were a cross-cutting card-treatment unification (#2, #3, anchored by the konjCard modifier suite from #A and the customCardBackground / customCardBorder named assets from #19 / #20) and a handful of screen-specific reorganizations: #4 Quiz dot-row, #6 VerbView etymology cards, #7 Settings App Icon thumbnails, #8 action-button differentiation. The Medium and Low layers were a long tail of polish — pill differentiation, pulsing icons, gradient dividers, sensory feedback on tab change, the speak-on-tap pattern extended to QuizView. Across roughly ten Claude Code sessions over three days (2026-05-05 to 2026-05-07), every High and Medium item shipped; #4(a) was implemented and #4(b/c) deferred; #8(a) shipped and #8(b) deferred; the four Low items (#18, #23, #24, #25) were marked deferred or not-recommended in the audit’s own framing. With my batch, Round Two’s actionable surface is closed.

The handoff system that connected the sessions. Each implementation session inherited a prompt file (docs/ui-audit-2-next-session.md) written by the prior session. The prompt carried a TL;DR of the queued items, a “Read first” reading list, pre-flight findings (line-drift checks against the audit doc, since the source had moved since the audit was written), a “Decisions to ratify” section listing two-to-four design questions for Josh to answer before any code was written, a recommended sequence, “Don’t” rules, and a “What’s next” pointer to whatever batch should follow. Most sessions spent their first turn writing a fresh questions file (docs/ui-audit-2-next-session-followup.md, ephemeral) listing whatever the prompt had not resolved, surfacing it to Josh, and letting his answers shape the implementation. Once the work landed, the session updated docs/ui-audit-2.md with status lines and resolution blocks, wrote a fresh next-session.md for the next batch, and deleted the followup files per the cleanup convention.

This is the shape of Jira without the Jira. The audit doc carried statuses, priorities, dependencies, and acceptance criteria (the recommended fix snippets); the handoff doc functioned like a sprint ticket; the followup doc functioned like sprint-planning Q&A. None of it was process-for-process’s-sake — every artifact existed because a downstream session needed something a prior session had to write down. A solo developer running a real Jira (or Linear, or GitHub Projects) on a side project would be paying overhead for almost no benefit; the markdown-and-conversation form Josh and the sessions used pays only for the parts that the next session reads. The doc tree at the end of Round Two contains the audit (kept) and the most recent next-session prompt (which I will delete on the way out, since no successor batch is queued).

Using ios-build-verify on #16. My piece was the OnboardingView page-1 layout: cap the leading Spacer() at 100pt to anchor content roughly a third down the page, and add a decorative yellow-tinted linear gradient to the upper canvas. Single file (OnboardingView.swift), two snippet additions. The skill carried me through the build → launch → screenshot → audit-doc-update arc without friction. Two specific moments are worth surfacing because they would have cost real wall-clock time without the skill. First, when I ran the AX3 spot-check (xcrun simctl ui $UDID content_size accessibility-extra-large) I needed to capture the title’s wrap behavior at large content sizes — exactly the kind of conditional layout that is tedious to verify manually because the Settings → Show Onboarding navigation has to be re-driven on every screenshot. With the skill, the cycle was three commands. Second, when I tried to spot-check a downstream onboarding page (D4b in the prompt’s decision list), I ran into the iOS 26 SwiftUI TabView(.page) gesture-injection wall — the simulator does not accept programmatic swipes through paged TabViews — and the skill’s SKILL.md already documented the recovery path: fall back to D4a, page 0 plus AX3. Documentation paying for itself in the moment is the ergonomic win the skill’s README understates.

Josh’s review as the quality gate. What kept the work above any single session’s blind spots was Josh reviewing the screenshots before each commit. Three concrete examples stand out because the value is not legible from any one of them in isolation. The Settings #7 batch shipped an App Icon picker with thumbnail previews; the implementing session pointed the bratwurst thumbnail at an existing imageset whose source PNG had a fully opaque cream-white background, and the thumbnail rendered as a bright white squircle against the dark Settings card. Josh’s screenshot review caught it; the session shipped the fix in the same commit. My own batch (#16) initially used the audit’s literal customYellow.opacity(0.08) for the upper-canvas gradient. On the Intel-Mac dev host the gradient registered cleanly in pixel inspection, but Josh ran the build on his actual iPhone and reported it was below his perception threshold on OLED. We bumped to 0.20. Same batch, latent bug: the title Text in OnboardingPageView had no .multilineTextAlignment(.center), so when the title wrapped on a smaller phone (or at AX3) the lines were left-justified within their bounding box while the body text below was centered. Josh’s iPhone surfaced it; the simulator on a tall iPhone 17 had been masking it because no title was wrapping. The Intel-Mac development host has its own host-eligibility gate around Apple Intelligence surfaces — Tutor brain pulse (#21), the ErrorExplainerView card, the Tutor onboarding page — that silently does not render; real-iPhone access by Josh closed the verification loop on those surfaces too. None of these issues were found by tests; they were found by a human looking at the actual rendered pixels on the actual hardware that real users hold.

That is the shape of the floor Josh argues for earlier in this post, applied to a different problem than the build-verify case. ios-build-verify is the floor that lets the agent see what its code did; Josh’s review is the floor that lets the human see what the agent’s screenshots could not capture — color perception thresholds on real OLED hardware, Apple-Intelligence-gated surfaces, latent bugs that only surface at certain content sizes. Both floors compose. Removing either would have produced a worse Round Two. The first floor without the second would have shipped at least three visual bugs that no automated test would have caught.

AztecCal converts dates from the Gregorian calendar to the Aztec calendar. The conversion is interesting on its own merits. For example, the Aztec calendar is a 260-day ritual cycle interlocked with a 365-day solar year. But the app is, for my purposes, a Petri dish: an iOS app small enough to develop quickly and complex enough to exercise the skill’s surface honestly. ↩
This is HID-faithful behavior. A real keyboard would not auto-clear a focused field when the user typed; AXe does not pretend otherwise. Faithfulness at the primitive layer is what allows the named-intent layer to compose primitives into operations whose names describe their effects. ↩
Screenshots remain the right primitive when layout, typography, color, or spacing are under review. The skill captures pixels for visual verification and reads the AXTree for state verification; the two are different surfaces with different failure modes, not redundant ways to verify the same thing. ↩
Conor’s skill is Python-based and works well for many users. My preference against Python tooling for a daily-driver workflow is a personal one, not a critique of his skill, and the discussion of cognitive debt later in this post is the deeper reason build-vs-adopt was the question I was asking myself. ↩
The iOS 26 accessibility-tree bug class has many members. Segmented Picker controls enumerate as AXTabGroup with empty children: [], exactly the shape of the Tab Bar’s empty children. SwiftUI controls visually segmented but accessibility-treed as single elements with hidden inner structure inherit the same FBSimulatorControl-layer bug; both require coordinate-tap or per-point inspection as the workaround. ↩
I have written elsewhere about this principle of persuasion. Vermont Rule of Professional Conduct 3.3(a)(2) obligates a lawyer to disclose to the tribunal legal authority adverse to the client, and the practice strengthens the argument rather than weakens it. See Two Applications of Life Experiences. ↩