The single highest-leverage practice in agentic iOS coding, as of mid-2026, is the one Anthropic’s Best Practices for Claude Code names directly: “Give Claude a way to verify its work.” On the iOS side, that practice has been hard to apply. I have built a Claude Code skill called ios-build-verify that makes it cheap.
The skill bundles two halves of the iOS agentic-coding loop. The build half pipes xcodebuild through xcbeautify for token-cheap building and unit testing, with raw output mirrored to a build.log file as a diagnostic fallback. The verify half pairs Cameron Cooke’s AXe, a Swift-native simulator-automation CLI, with xcrun simctl and exposes them through named-intent operations: launch the app, tap a control by its accessibility identifier, read or set a field’s value, verify a screen has loaded, screenshot a named view, and audit a view for missing accessibility modifiers. State checks read AXe’s describe-ui accessibility-tree dump rather than screenshots, favoring text before pixels. Screenshots land on disk and are read only when layout, typography, color, or spacing are actually under review.
Installation in Claude Code is two commands:
/plugin marketplace add https://github.com/vermont42/ios-build-verify
/plugin install ios-build-verify@ios-build-verify
The README documents the install paths, the operations the skill exposes, and the starter prompt. I am not going to recapitulate any of that here. The README is the surface; this post is the why underneath it.
This is the second post in a series on Claude Code skills I have built for iOS. The first, Borrowing Taste from the Web, narrated the iOS Design Agent Skill: a port of Anthropic’s frontend-design skill that gives Claude Code a designer’s eye on iOS interfaces. The two skills address orthogonal halves of agentic iOS development. The first asks whether the UI looks right; the second asks whether it works.
The Verification Floor
Anthropic’s guide names self-verification as the agentic-coding leverage point that matters most. Under the heading “Give Claude a way to verify its work,” the guide states: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” It elaborates:
Claude performs dramatically better when it can verify its own work, like run tests, compare screenshots, and validate outputs. Without clear success criteria, it might produce something that looks right but actually doesn’t work. You become the only feedback loop, and every mistake requires your attention.
The phrase to lift from this guidance and hold onto is the only feedback loop. When the human is the only feedback loop, every mistake the agent makes commands the human’s attention. There is no path to a higher-quality human-in-the-loop than the one that has the human verifying typos and missing semicolons; the human’s attention is finite, and it is being spent on work the agent could have done.
The guide names the failure pattern this produces as “the trust-then-verify gap”: “Claude produces a plausible-looking implementation that doesn’t handle edge cases.” The prescription is the title of this post in mirror image. Trust the agent’s claim that the implementation is done; then verify the implementation actually works. “Always provide verification (tests, scripts, screenshots),” the guide concludes. “If you can’t verify it, don’t ship it.”
I want to draw a precision distinction here, because the agentic-coding discourse tends to collapse three different things into one. Self-verification is not self-direction; self-direction is not self-deployment. Self-verification is the agent’s ability to check whether its own output meets criteria the human set. Self-direction is the agent’s ability to decide what the criteria are without human input. Self-deployment is the agent’s ability to ship to production without a human gate. The distinction matters because the marketing literature sometimes elides it. MindStudio’s framing of the “dark factory”, for example, conflates all three under a single banner of unattended autonomy.
ios-build-verify is purely the first. It enables the agent to check whether its code change produced the behavior the human asked for, and it surfaces failures with diagnostics specific enough that the agent can act on them without escalating. It does not decide what the human asked for; it does not ship the result. Self-verification is the floor that makes higher-quality human-in-the-loop possible. Self-verification is not the abolition of human-in-the-loop.
The cognitive-load shift this enables is the strongest non-dark-factory argument for the skill. A verification-capable agent moves the human from “is this code correct,” which the agent can answer, to “is this approach right,” which only the human can answer. The human stays in the loop. The human stays in the loop at the level of judgment, not at the level of typo-catching and semicolon-spotting that wastes engineering attention. That is the trade I want from agentic coding, and it is the trade ios-build-verify is built to enable for iOS engineering.
How I Used to Verify
The frustration that produced this skill had a specific source. In the spring of 2026 I shipped Konjugieren, a free iOS app for learning German verb conjugation, built over twelve weeks with Claude Code as my AI co-developer. The app has 14,900 lines of Swift and 416,000 words of bilingual prose; it includes three on-device AI features, six widgets, and a quiz with Game Center leaderboards. It is the most ambitious app I have shipped.
The verification half of development was tedious. The agent would produce a feature; I would tell Xcode to build and run; I would launch the simulator; I would tap through the new flow; if something looked wrong, I would screenshot the simulator and paste the image into the conversation; if something behaved wrong, I would describe the failure in prose. Every iteration cycle bottlenecked on me. The agent moved at the speed of language; my eyes and my keyboard moved at the speed of my eyes and my keyboard. The asymmetry compounded. By the late stages of Konjugieren’s development, the human was, demonstrably, the slowest part of the loop.
Having shipped Konjugieren, I decided to address this friction. AztecCal, an Aztec-calendar conversion app I have been developing since late April 2026, exists for this purpose and this purpose only. It is, in my own private taxonomy of side projects, the first one I have built whose purpose was not to ship an app but rather to build the tools that would make shipping the next one faster and easier.1
The Build Half
The agentic-coding case for piping xcodebuild through xcbeautify is, at its root, an argument about token economy. Every token spent on build-output noise is a token unavailable to the agent’s reasoning, and the context window is fixed per session. When the window runs out, the harness compacts prior context, and compaction loses information. Tokens spent on plumbing are not paid once; they are paid forward every time compaction triggers earlier than necessary, with each compaction degrading the agent’s grip on the actual problem.
I measured the compression ratio on two real apps: AztecCal, the laboratory project (14 Swift files, no SwiftPM dependencies); and Konjugieren, materially larger (a main app, a Widget extension, a Shared dual-target, and a TelemetryDeck SwiftPM dependency). AztecCal’s clean build went from 406 raw lines to 61 beautified lines: 6.7×. Konjugieren went from 1,694 to 311: 5.45×.
That is one axis of the case for xcbeautify. The other is human-readability, which improves dramatically even at modest compression ratios. Sixty-one beautified lines are scannable. Four hundred and six raw lines are not. When I read an agent’s transcript in terminal scrollback or a CI log artifact, I want the same signal-to-noise ratio the agent gets. Auditor ergonomics are a first-class win, not a byproduct.
xcbeautify is, unfortunately, a lossy filter. Some of what it drops, the agent occasionally needs. Multi-line Swift fix-it hints get compressed; AppIntents-metadata warnings disappear silently; swift-frontend linker chains get summarized to the point where a “referenced from” lookup goes quiet. The first clean build I ran with the skill in place caught both the surfaced and the dropped cases at once. AztecCal’s Converter.swift emitted a real Swift 6 actor-isolation warning, displayed cleanly with xcbeautify’s warning marker and a source-caret excerpt; the same build emitted an appintentsmetadataprocessor: warning: Metadata extraction skipped. No AppIntents.framework dependency found. line that xcbeautify swallowed without a trace. The dropped warning was benign in context (AztecCal does not use AppIntents, so “skipped” is correct), but its category was exactly the one I had predicted would get lost.
The skill’s response to xcbeautify’s lossiness is a tee build.log mirror: every build’s raw output goes to a file, the agent reads xcbeautify’s condensed summary on the happy path, and the agent falls back to Read-ing build.log directly when the summary does not answer “what do I change.” The raw log earned its keep on day one of organic use, the strongest available vindication of the dual-output mirror.
The diagnostic shape xcbeautify produces is Path/To/File.swift:42:15: error: .... This format is the same shape grep -n produces, the same shape every IDE understands, and the same shape Claude Code uses internally for source references in conversation. It is, in the agentic-coding sense, the format the agent already knows how to act on: Edit Path/To/File.swift is the immediate next move, with no intermediate transformation required. The choice of format is not decorative. Picking any other shape would force the agent to re-parse before acting; picking this one means the next step is unambiguous.
The Verify Half
The verify half rests on three primitives, all drawn from AXe and simctl. Lifecycle operations boot the simulator, install the freshly built app, launch by bundle identifier, and terminate between runs to reset in-memory state. Drive operations dispatch input events to the simulator: tap by accessibility identifier as the default selector; tap by accessibility label as the secondary; tap by coordinate for elements the accessibility tree fails to expose; plus type, swipe, and key-combo for the rest of the input surface. Observe operations read structured state back: axe describe-ui emits a JSON dump of the accessibility tree, and axe screenshot writes a PNG to a file.
The crucial distinction the skill rests on is between driving the input and observing the outcome. I credit AXe’s official skill for this framing. The AXe CLI dispatches input events at the Human Interface Device (HID) layer. When axe tap exits 0, the agent knows the tap event reached the simulator; the agent does not yet know that the app processed the event. A tap might land on a region with no gesture recognizer, arrive while a transition is in flight, or hit a control that has been disabled since the last describe-ui. The skill demonstrates this honestly. axe tap -x 5000 -y 5000 exits 0 on a 402-point-wide simulator screen, and nothing happens. Exit codes carry dispatch-success semantics, not behavioral semantics, and the verification work is a separate, explicit step.
Here is where named-intent operations earn their place in the design. A bare axe tap followed by a bare axe describe-ui followed by a bare grep for the expected post-condition is a sequence that the agent has to compose every time. A named-intent operation composes the sequence once, exposes it as a single verb whose name describes the intent, and lets the agent reason at the intent level. The cleanest demonstration in the skill is verify_value.sh. The agent calls verify_value.sh input_convert_month "7". On match, the script echoes 7 and exits 0; on mismatch, it prints error: expected '7', got '4' and exits 6. One call, observation and assertion together. The agent gets a parseable diagnostic in one line.
There is a small architectural beauty to how the named-intent layer composes its primitives without papering over their honesty. axe type is HID-faithful: it does not replace existing text in a focused field; it appends.2 So set_value.sh, which does what its name says (set this field to X), cannot be a one-liner over axe type. The original plan called for a per-key-backspace clearing loop, sized by reading the field’s current value first. That worked, but it leaked the underlying mechanism. The version that shipped does something better. It composes axe key-combo --modifiers 227 --key 4 (Command-A, select all) with axe type "$TEXT". Two HID dispatches, constant-time in field length, no need to know the field’s current contents first. The primitive layer stays honest about appending; the named-intent layer hides the consequence by reaching for the right second primitive.
Now the cost asymmetry. State checks via describe-ui cost a few hundred tokens per call. Screenshots cost between 1,600 and 6,300 image tokens, depending on resolution and content. The 10×–30× difference compounds across a verification flow. One Konjugieren-shaped flow, which might run thirty state checks across a feature’s verification, is the difference between finishing a feature in a session and navigating the Scylla and Charybdis of context compaction and reset. The text-before-pixels rule is a token-economy argument first.
This rule also promotes reliability. Pixels are noisy. Anti-aliasing varies by GPU; transient cursor blink is not deterministic; animation frames intercept a screenshot at different progress points across runs. Comparing screenshot bytes to determine “is this state X” is a fragile equality. For example, Justin Searls, co-founder of Test Double, has observed, of the related practice of snapshot testing, that “because they’re more integrated and try to serialize an incomplete system… they will tend to have high false-negatives.” (Quoted in Kent C. Dodds, Effective Snapshot Testing.) Comparing AXValue strings to determine the same thing is exact-match comparable. The text path is strictly more reliable; the cost asymmetry is gravy on top.3
The Principles That Emerged
The skill is the artifact, but the principles that crystallized out of its development are, I submit, the more portable contribution. They generalized past the iOS context, past Claude Code, and past the specific shape of an accessibility tree. Here are four of them, in the order in which they earn their keep.
1. Lenient at the schema layer, strict at the assertion layer. The skill’s verification surface is the SwiftUI accessibility tree, which means its quality depends on whether the target app carries the relevant .accessibility* modifiers. The architectural fork was: require modifiers at install time (strict), or work against whatever is present and grow coverage with use (lenient). Lenient won. A strict skill closes the adoption door for every existing iOS codebase that did not anticipate verification annotations; a lenient skill works on day one and verification quality scales with annotation coverage as the user uses it. But within a lenient adoption envelope, the skill’s assertion operations are unyieldingly strict. read_value.sh exits 5 on duplicate identifiers rather than picking the first match. The reason is the same shape as the lenient case in mirror: silent ambiguity in the assertion layer would erode the trust that lenient adoption was buying. Lenient at the schema layer is for adoption; strict at the assertion layer is for trust.
2. Loud failure at the boundary where the cause is visible. The argument springs from an incident I will describe here. A validator agent, running the skill against a calculator-shaped app for the first time, typed the string dozen-fives into a TextField. iOS’s default .textInputAutocapitalization(.sentences) setting transformed the string into Dozen-fives between the HID type event and the AXValue read-back. The early version of set_value.sh had no post-condition check; it exited 0 with set: input_calc_label = 'dozen-fives' (a green log), and the bug surfaced two layers downstream when verify_value later failed against the saved row with a diagnostic that pointed at the wrong place. The fix was to make set_value.sh re-read the AXValue after typing and exit 6 with a three-cause diagnostic if the bound state does not match the input. Autocapitalization was the second cause in the enumeration; the validator’s fix was a one-line .textInputAutocapitalization(.never) on the affected field, made on the first try. The principle generalizes past set_value. The cost of catching a bug two layers downstream is not paid once; it is paid by every future debugger of the same-shape bug. Loud failure at the boundary where the cause is identifiable, with a named cause and a corrective action, is the pattern.
3. Mechanize prose recipes. A SKILL.md sentence saying “prefer leaf elements when adding launch-screen anchors” is weaker than an error message that says “looks like rollup; here is what is actually present in the tree.” A pre-flight calibration recipe that requires the reader to “open the screenshot at 100% and measure” is weaker than a script that does the centroid detection automatically. New skills are mostly prose; mature skills are mostly scripts. The path from one to the other is repeated validation passes that identify a prose recipe in need of mechanization. The clearest example in the skill is the _classify_present_ids.sh helper, extracted from a recurring pattern in read_value.sh’s exit-4 diagnostic. The same hint surface had been classifying three failure modes (identifier rollup, modal-popover gating, app crash) in three different ways across as many sessions; pulling the pattern into a sourced helper made the classification deterministic and reusable, and the principle had a name.
4. Migration by use beats whole-project audit. For existing iOS projects whose codebases predate verification-focused accessibility annotations, the skill’s verify operations include an annotation-check phase: when the agent verifies a screen, it ensures the relevant elements carry the modifiers the verification needs, proposing additions inline as part of the same change. The user does not run a separate “audit the whole project” task. Coverage grows where the user is actively working. Three properties make this the right shape rather than the wrong one. First, migration cost amortizes across routine feature work. Second, coverage matches use; the most-verified parts of the app become the most-annotated parts, exactly the right shape since the long tail of unverified screens did not need annotations anyway. Third, every annotation added is justified at the moment of writing by the verification flow that needed it. Tools that demand prerequisite work before being valuable lose against tools that produce value on day one and grow into their full surface as users adopt them.
Build, Don’t Adopt
My goal was to close the agentic loop from day one, but my plan changed. I started this project intending to adopt someone else’s skill. Conor Luddy’s ios-simulator-skill was the natural starting point. Conor has done considerable thinking about agent-driven simulator interaction and has written two posts on the subject that are worth reading independently of his skill: Bringing Accessibility into the AI Coding Workflow and Building a Swift Accessibility Skill. I worked with his skill, examined its design, and decided to build my own. The clean version of why: Conor’s skill is Python-based, and I prefer not to maintain Python tooling for a daily-driver workflow.4 I also looked at XcodeBuildMCP, the TypeScript and MCP-server alternative, and decided against it for reasons I will describe shortly.
The deeper reason is one I did not understand until Margaret Storey, in February 2026, gave me a name for it. In her post Cognitive Debt, Storey draws on Peter Naur’s theory of the program, the collective developer understanding of what the program does and how it can be changed, and observes that AI velocity threatens the theory: the code can stay readable while the human’s grasp of why it was written that way evaporates. Cognitive debt, in her framing, is the debt compounded from going fast, and it lives in the developers’ minds rather than in the code. The distinction from technical debt is the move that makes the framework do real work. Technical debt is a property of the artifact; cognitive debt is a property of the people who maintain the artifact, and the only currency that pays it down is the slow work of building or rebuilding the theory.
Adopting a skill or MCP wholesale, even if its design is a perfect fit for one’s needs, opens a Storey-shaped gap between running code and theory-held-in-mind on day one. Studying the adopted skill or MCP can pay the debt down, but the cost of holding someone else’s theory often exceeds the cost of building one’s own. Someone else’s design carries assumptions one does not share; someone else’s abstractions optimize for cases one does not have; the consequential decisions are buried under cosmetic ones. Building oneself, informed by having surveyed the alternatives, lands one at zero debt with the survey work already done. The survey is not waste. It is what distinguishes informed building from blind reinvention, and it is what distinguishes this principle from Not-Invented-Here syndrome, which causes one to build from ignorance; I advocate building from informed choice.
The principle is selective. I delineate the boundary clearly because the cognitive-debt argument can be misread as anti-dependency in general, and that would be wrong. AztecCal depends on AXe, xcbeautify, Swift, Xcode, the iOS SDK, and simctl with no implementation theory held in my mind, and that is the correct approach. The line I draw is roughly this: for artifacts in the daily-driver modification path, the cognitive-debt math favors build over adopt; for stable libraries one will only call, adopt is fine. The verification skill sits on the build side because it will evolve with every iOS update and every new feature; AXe sits on the adopt side because I will not be patching its Swift internals. What I need from AXe is interface theory (which operations exist and how they compose), not implementation theory.
Constructive application of this principle requires cabining where it does not apply within this skill’s own dependencies. AXe (Cameron Cooke) and xcbeautify (Charles Pisciotta) are both third-party tools that ios-build-verify depends on at its boundaries. If either maintainer becomes unresponsive or if the tool falls behind iOS releases, the affected half of the skill breaks until someone forks or reimplements. The risk is acceptable here for three concrete reasons: the current implementations work well for the skill’s needs as of iOS 26.3; reimplementing either from scratch is not a realistic time investment for a solo developer; and both projects are actively maintained, as evidenced by their GitHub activity.
The reader-facing recommendation that follows is that you should try my skill, then build your own. The cognitive-debt math is one a reader can run only after working with a skill in the daily-driver path long enough to know whether it fits. My recommendation is to install ios-build-verify, exercise it on a SwiftUI app for a week, and then decide. If the math leans toward keeping it, keep it. If the math leans toward replacing it with a skill shaped to your own loop, replace it. The skill’s source is short enough that reading it is realistic; the operations are scripts whose behavior is inspectable; the four principles in the previous section are the parts I think will travel even if the implementation does not.
The Hardening Process
Hardening ios-build-verify was the part of the development arc I least anticipated and the part that most changed the artifact. The skill itself took about a week to write. The hardening cycle that followed took about three days, and the artifact at the end of those three days was meaningfully different from the artifact at the start.
The shape of the cycle, named retrospectively, is two-sessions-per-pass. A validator session runs the skill against a fresh project (Calculator, Calculator2, Calculator3, GenericApp, GenericApp2, Konjugieren) under explicit “report friction honestly” framing; the validator agent is a fresh Claude Code session with no carry-over context from prior passes. The validator writes a continuous friction log during prompt execution. A synthesizer session, run by me in the AztecCal laboratory, reads the validator’s notes, weighs and reframes the findings, and ships changes to the skill. I want to claim something quietly significant about this cycle, namely that the asymmetry between the validator’s fresh muscle memory and my accumulated context is the engine that powers it. Friction I have absorbed silently re-emerges for fresh validators, and the workflow forces it back into view.
One number captures the hardening better than any prose. In the Calculator2 session (May 1, 2026), a validator tried to flip a SwiftUI Toggle inside a Form inside a NavigationStack and never converged. Tap-by-label dispatched to AXFrame coordinates that did not match screen coordinates; tap-by-coordinate at visually-measured positions did not trigger the gesture; set_value reported exit 0 every time despite read_value showing the AXValue stayed unchanged. The session ended with the Toggle un-flippable. After the May 2 hardening (loud failure in set_value, named-cause diagnostics, a cross-referenced workaround section in SKILL.md), the Calculator3 session walked the same scenario in seven script invocations. set_value.sh exited 6 with a diagnostic and a cross-reference; the validator followed the cross-reference to SKILL.md’s “iOS 26 Form-in-NavigationStack” section; the Toggle flipped on the first try. Unbounded to seven, in one hardening pass. Numbers like that beat prose claims like “diagnostics improved,” because they are falsifiable.
Another finding from this arc changes how I think about validation as a design tool. On May 3, the GenericApp validator was trying to verify the selection state of a segmented Picker, hit the iOS 26 accessibility-tree-empty-children bug class,5 looked for a verify path, and discovered axe describe-ui --point <x>,<y>: a per-point inspection primitive my own design document had asserted did not exist in AXe 1.6.0. (My document had verified the absence of the named command axe describe-point, which is genuinely absent. The inference that per-point inspection itself was absent was wrong. The capability lives under a flag on the existing describe-ui command.) The skill went six sessions without documenting --point. Validators discover not just bugs in capabilities the developer already knew about, but capabilities the developer did not know to look for. That is a different relationship to one’s own design than testing, and it is the relationship that makes parallel validation worth running.
The mechanics of the validator-synthesizer cycle deserve their own post, and I am going to tease one rather than try to fold the workflow into this one. The pattern that emerged covers domain non-overlap as a strategy, the report-as-contract artifact, the synthesizer’s reframes (not its rubber-stamps) as the place value lives, and the round-trip-count metric that tells one whether the next pass is shipping changes that matter. For now I will say only that the pattern is portable past iOS, past skill-development, and past Claude Code; it generalizes to any report → triage → fix workflow in which the validator and the synthesizer can usefully be two different actors.
Drawbacks
The skill has four drawbacks. For the sake of both courtesy and persuasion, I will address them.6
First, iOS 26 has a class of accessibility-tree bugs that travel across every simulator-automation tool, including AXe, idb, and any XcodeBuildMCP build path that traverses describe-ui. The bug class lives at the FBSimulatorControl layer, beneath all of these tools, so switching tools does not rescue the workflow. The known instances as of iOS 26.3: TabView children-not-enumerated (the Tab Bar AXGroup is enumerated with empty children: []); AXFrame-vs-rendered-geometry divergence (the iOS 26 floating tab pill reports a frame much wider than its visible width); Slider AXValue typeMismatch (AXSlider elements emit a numeric AXValue, which AXe’s JSON decoder cannot round-trip and which then poisons unrelated tap_id lookups in the same describe-ui call); smart-punctuation rewriting on TextField and on TextEditor (smart dashes and smart quotes silently transform typed input on iOS 26); Form-in-NavigationStack autocapitalization (already discussed). The skill works around each of these with coordinate-fallback tables, post-condition checks, or documented workarounds; the bug class is not the skill’s fault, but the workarounds are.
Second, the skill violates its no-Python aspiration. measure_tab_pill.sh, which detects per-tab centers in a screenshot of the iOS 26 floating tab pill, uses Python and Pillow to do the image work. Pure-bash centroid detection would be a substantial reinvention for marginal benefit, and Apple ships Python 3 with macOS 12.3 and later. The README lists Pillow as an optional dependency, since missing Pillow only blocks tab-pill calibration and not any other verify operation. The cleanest version of the skill’s pitch is “no Python,” and the skill does not honor that pitch. In practice, the aspiration manifests as “Python where it pays for itself, shell elsewhere.”
Third, the cognitive-debt argument applies to AXe and to xcbeautify as much as it applies to Conor’s skill, which I have already addressed. I am taking the maintenance-loop risk on both, and the principle does not absolve the skill of the risk; it constrains the choice of which dependencies sit on the call side and which sit on the modify side.
Fourth, the skill has been validated against Claude Code (CLI) running Claude Opus 4.7. The shell scripts are harness-agnostic, but the skill’s use leans on agent judgment in places that have only been exercised on this configuration: reading SKILL.md after set_value.sh exit 6 and applying the documented Form-in-NavigationStack workaround; running the agent-led colloquy without it derailing into ambiguous “your proposed answers are good” replies; recognizing when to use MAIN_TABS_COORDS versus editing the shared data/coordinates.json. Behavior on untested configurations (Sonnet, Haiku, non-Anthropic models, IDE-embedded agents, MCP-driven setups) may vary from “works fine” to “subtly wrong in ways that look like skill bugs but are actually agent-judgment shortfalls.” I have chosen to surface this scope-of-validation framing in the README and in SKILL.md prominently, rather than imply universal applicability the skill has not earned. Reports from other configurations are welcome.
Closing
The scope of ios-build-verify is precise. It does not ship the iOS app; it does not decide what the app should do; it does not even decide what to verify. It puts the question “did the change I just made produce the behavior I expected” into the agent’s reach, so that the human reviewing the work does not have to be the only feedback loop. That is the floor it puts under higher-quality human-in-the-loop. As a wise man once observed, self-verification is the floor that makes higher-quality human-in-the-loop possible. Self-verification is not the abolition of human-in-the-loop.
The reader-facing recommendation is the one Build, Don’t Adopt argued for. Try ios-build-verify on a SwiftUI app of your own. Then, if the cognitive-debt math leans that way for you, build your own. The skill is small, the operations are scripts, the SKILL.md prose is shorter than this post; the case for keeping the skill or for replacing it is, after a week of organic use, one a reader can make.
Credits
Cameron Cooke for AXe, the verify half’s foundation. Charles Pisciotta for xcbeautify, the build half’s. Conor Luddy for ios-simulator-skill and his two writing pieces, and for closing two of my issues against his skill on the same day I filed them. Anthropic for Best Practices for Claude Code and for the frontend-design skill that is the spine of the iOS Design Agent Skill that preceded this one. Antoine van der Lee for the install-instructions structure I borrowed from his SwiftUI Agent Skill. Lawrence Lomax for the idb framework on whose lower-level libraries AXe builds. Margaret Storey for Cognitive Debt. The validator agents that hardened this skill across eight sessions, and the small army of Claude Code instances that wrote most of the actual scripts.
-
AztecCal converts dates from the Gregorian calendar to the Aztec calendar. The conversion is interesting on its own merits. For example, the Aztec calendar is a 260-day ritual cycle interlocked with a 365-day solar year. But the app is, for my purposes, a Petri dish: an iOS app small enough to develop quickly and complex enough to exercise the skill’s surface honestly. ↩
-
This is HID-faithful behavior. A real keyboard would not auto-clear a focused field when the user typed; AXe does not pretend otherwise. Faithfulness at the primitive layer is what allows the named-intent layer to compose primitives into operations whose names describe their effects. ↩
-
Screenshots remain the right primitive when layout, typography, color, or spacing are under review. The skill captures pixels for visual verification and reads the AXTree for state verification; the two are different surfaces with different failure modes, not redundant ways to verify the same thing. ↩
-
Conor’s skill is Python-based and works well for many users. My preference against Python tooling for a daily-driver workflow is a personal one, not a critique of his skill, and the discussion of cognitive debt later in this post is the deeper reason build-vs-adopt was the question I was asking myself. ↩
-
The iOS 26 accessibility-tree bug class has many members. Segmented Picker controls enumerate as
AXTabGroupwith emptychildren: [], exactly the shape of the Tab Bar’s empty children. SwiftUI controls visually segmented but accessibility-treed as single elements with hidden inner structure inherit the same FBSimulatorControl-layer bug; both require coordinate-tap or per-point inspection as the workaround. ↩ -
I have written elsewhere about this principle of persuasion. Vermont Rule of Professional Conduct 3.3(a)(2) obligates a lawyer to disclose to the tribunal legal authority adverse to the client, and the practice strengthens the argument rather than weakens it. See Two Applications of Life Experiences. ↩