Phil @ ShinCBM

What I'm working on: Building a Workflow

Fri, 26 Jun 2026 00:00:00 +0000

This is a working summary of things I’ve been working on to build a workflow for agentic development over the last 6 months.

Agentic development will scale up all software development. I see two directions: further along a bad scale (big, clumsy teams that make poor software) or back toward a good scale of smaller, more capable teams. I feel the two paths map onto a split: bad open-loop workflows with no feedback, where the connection to stakeholders and mastery of the technology are lost, versus good closed-loop workflows that converge by building understanding of both.

So to “close the loop” I’m looking for “forcing functions”, constraints applied to guide development toward convergence. These can be informal or formal: a specification in human language, or in a formal modelling language. For software I prefer structured but informal, for systems a formal system description. The distinction comes as software at the core needs ‘slack’, but at a system level interfaces harden and should be locked down.

My two main areas of interest here are software development and systems modelling.

Software development

My main focus is structured workflows, such as pipelined processes and feedback loops, represented as agent skills.

My Claude Skills Library

nakane1chome/claude-skills

I’ve used agent skills to codify workflows during development, I’ve focused on a few areas:

Technical document creation
Software development lifecycle
A harness for evaluating skills

Technical Document Creation Skills

This is a set of skills that help pull ideas from my brain into text. The key point about these skills is:

Each skill uses a pipeline to refine and edit a technical document, such as a spec or design.
The different skills represent different stages of document progress:
- flesh-out converts raw notes into a working document.
- review-steps should be a check, doing the clean-up.
- strong-edit should edit down to a final document.

I’ve described it in more detail here: Claude Beyond Code: Intent Expression and Agent Skills

The pipeline looks like this:

Software Development Lifecycle Skill

claude-skills/light-v-structure

I think iterative development via error-correcting feedback loops is the best way to converge on a desired system. For a team an Agile workflow with a tight code-release-evaluate-iterate cycle is a good harness for convergence, but for agentic development I think the story changes.

The traditional systems engineering V-model can provide the harness for an agent to work autonomously on the low-level implementation.
To enable that a human needs to work at a high level and provide the spec of what the software should do.
This is NOT waterfall! The point is discrete levels of abstraction that are documented or modelled, not discrete stages of development that are gated.

Some V-model development practices are not iterative, for example an anti-iteration practice is stage gating. For agentic development there is an inversion that means stage gating is not needed:

Costs have inverted. Code is now cheaper to produce and verify than specifications.
A specification may be cheaper to validate via an autonomous implementation loop, and use an Agile style release increment to collect stakeholder feedback.
But to iterate the full loop we need a memory of the project and rails for the iteration, this is now is the value of the V-model.
The cost of validation (does it match intent?) is hard to reduce, but the cost of verification (does it match the spec?) can be reduced by scaffolding and tests. That asymmetry, cost(validation) ≫ cost(verification), is why the human owns the top of the V and the agent owns the bottom.

NOTE: Removing stage gating ≠ removing release gating. On the contary, we can have stronger release gating as we keep accumulated records of what the system does and why.

The idea is a human-dominated top of the V, and an agent-dominated bottom of the V, each with their own iterations driving correctness.

This general idea is similar to Spec-Driven Development (SDD): GitHub Spec Kit, AWS Kiro, OpenSpec, and more. However, the Light-V process is a generalization of a docs-as-code workflow I’ve been using for embedded systems for over 10 years. The main difference is that it is a portable document structure and an agent use discipline, not a tool to adopt. It also pulls in the V-model traceability symmetry: every right-side artifact (test, validation) must verify a left-side one (design, architecture). It targets building new ideas from scratch, growing structure from nothing rather than retrofitting a spec onto an existing codebase.

Dimension	Spec-Driven Development	Light-V
Form	A tool/CLI plus templates and slash commands	A portable doc structure + agent use discipline (no tooling)
Primary artifact	The natural-language spec (including semi-formal like EARS or Gherkin); code via spec-as-source, or spec-anchored code and update	A hierarchy of natural-language docs: code via spec-as-source from design, and spec-anchored for higher layers
Flow	spec → plan → tasks → implement (largely linear)	paradigm → ADR → architecture → design → source → test → validate (iterate)
Verification	“Write testable requirements”	The right-side test verifies its left-side counterpart (design to unit tests, architecture to system tests)
Validation	Implicit - the spec is assumed to capture intent	An explicit human feedback loop at the top of the V, does the result meet the original intent?
Decision capture	The living spec	Architecture Decision Records (ADRs) governed by a paradigm document
Orientation	Often retrofitting onto existing code	Greenfield - building new ideas from scratch

Testing and Evaluating of Skills

Example test report generator-coding skill test run
End to end tests for skill : nakane1chome/claude-skills/tests/
Framework for testing skills : nakane1chome/claude-skills/tests_fw/

What I’m trying to do here is automated test & evaluation for Claude skills.

Test the skill - confirm it does what it claims.
Evaluate the skill - measure effectiveness. Add test points that depend on agent capability.
CI Test the skill - ensure each update maintains effective.

Generator (Meta-Coding) Skill

claude-skills/generator-coding.

Agents are capable enough to solve meta-problems, that is, code that generates code:

They often create scripts to make bulk changes instead of burning tokens on doing the work themselves.
If there is a common data model shared between many source files, correctness can be improved by using a code generator, rather than having an agent code each unit individually.
The skill encodes the pattern (Data Model → Parser → Helpers → Template → Output), and the rule: never edit generated output.
- One of the key points here is having the agent create a data model and a set of templates.
- The data model is a formal model that can act as a forcing function across the system.
- Code templates enhance code quality by reducing the variance in an agent’s output when it repeats the same activity.

System Modelling

These are projects focusing on system level development, using system models - rather than pure software.

System Modelling Language Translation and Query Model

This is the first project where I’ve applied the claude-skills/light-v-structure end-to-end and generated all code via an agent (Mostly Claude Opus 4.6 and 4.7 ).

This is a personal project to create a common interface to various System Modelling Languages - like Pandoc translates between document formats, but for machine-readable system models.

Scoped to system topology - what components exist, how they connect, and their properties - not behaviour (state machines, control flow, simulation).
The idea is to create formal system models as a single source of truth across implementation & verification.
These formal models should editable by agents, humans, or both. Concurrent editing should be possible.
The editing should be traceable via an immutable fact model (an append-only graph of facts, never updated in place; each correction is a new fact referencing the ones it supersedes, preserving full history and provenance)

Initial Targets focus on embedded systems:

Register Description: CMSIS-SVD (System View Description), SystemRDL (Register Description Language)
Hardware Modelling: VHDL (VHSIC Hardware Description Language), structural subset
System Modelling: AADL (Architecture Analysis & Design Language)
General Modelling: SysMLv2 (Systems Modeling Language v2)

Implementation:

Architecture and design via: claude-skills/light-v-structure
All code & tests autonomously agent written, a combination of C++ and Python.
Testing grounded against human-owned use cases and third-party reference tools (existing parsers, round-trip conversion).
Model storage via an in-memory immutable fact store (looking for a better solution here)
Fact type system inspired by OMG Meta Object Facility
Model query interface based on datalog

What worked and what did not:

First-order implementation worked: the agent implemented the parsers, generators, and their tests completely and correctly. Correctness was determined using third-party parsers to verify syntax, or round-trip conversion to verify equivalence.
Most target languages have no C++ parsers or libraries - the domain is dominated by Java. Some pushing was needed to get the agent to realize that implementing a new parser from open-source grammars was low-cost; otherwise it would create incomplete implementations, or claim features were impossible because no C++ version was available.
Much pushing was needed to get the agent to use consistent testing and code generation according to common guidelines.
Second-order implementation was less successful: the agent could not successfully generalize the implementations to architectural concepts. It preferred spaghetti to architecture. The architecture uses an explicit set of abstractions across languages (the MOF layers M1, M2, M3); when asked to design classes to represent those generalizations, the agent created concrete, inflexible designs that overfit the use cases. I needed to give detailed instructions at this level - but the agent could still produce all the code.

Summary in Spec-Driven Development (SDD) terms - using the three levels of specification rigour:

The implementation in the first-order (that is from design docs) was successfully completed with the Spec-as-Source model of development - the spec generates the code, the human never edits it.
The implementation in the second-order (that is from architecture docs) was only completed with the Spec-Anchored model of development, although the agent was responsible for code editing.

Time Model for Distributed Simulation

This is a personal project being worked on as an exemplar for the systems-need-formal-models workflow: time a first-order dimension software can’t absorb, so a simulator that must be correct in the time domain needs a formal model, not just libraries.

The key idea is to create a logical model, formally specified, of time in distributed simulation

Logical model of how time as a first-order system property is communicated with a system constrained by real time and bandwidth.
Models real, virtual, and sampled time domains; nodes and channels on a timestamped network; using parameters and constraints to derive synchronisation.
Core idea: all time is independent, but across the system it can be constrained by a model aware of the system’s topology.
Plan to model using SysMLv2 (or other language? TBD)

I expect an ordinary simulator (a program that computes an approximation of a real-world system) will become easier to write with agentic development. But writing a simulator with well-defined behaviour in the time domain and full observability will stay hard without a formal model to constrain it. Unlike most software, a simulator must deal with system attributes such as time, and with the physical and logical layers of a system. For example, time is not well managed by software as software consumes time in its own operation, so it cannot separate the concern of time from compute.

Logical Model of Knowledge Understanding

Dimensions of Understanding

This is an informal model authored entirely by agents, this is the start of an attempt to build a logical architecture for understanding the knowledge structure common to humans and AI. The premise is that a model of understanding can exist independent of any implementation. If it can, several things follow: we could define interfaces to understanding, test and characterize implementations against them, interface to other systems below the text layer (via formal models), and partition models along those interfaces.

Claude Beyond Code: Intent Expression and Agent Skills

Sun, 15 Feb 2026 00:00:00 +0000

I feel like productive developers have often resisted documentation: it interrupts the flow of getting code out of your head and into the editor. Interestingly, my recent experience of agentic coding is flipping this. When an agent writes the code, the bottleneck moves upstream to: “how clearly can I express the idea?” Documentation isn’t the interruption anymore, it’s leverage over the agent.

This post introduces a set of skills for agentic document elicitation and review. The skills are reusable instruction files for Claude Code, covering document creation and review. First, I’d like to present why I’ve created them.

The leverage shift

The traditional developer’s fulcrum has been mental clarity: keep the system idea in your head, design and refine in place. The lever moves well-formed code from brain to editor at speed.

In that model documentation was an interruption, at best a necessary evil. Interrupting that flow to “document it” was disruptive. You did it because collaboration required it, not because it helped you think.

But there is a trap here. A productive developer who tries to slot an agent into their existing mental-clarity workflow may actually lose productivity. New leverage doesn’t come from keeping the idea in your head and using the agent as a faster typist. It comes from changing the workflow. Going back to documentation, this is a way to share the idea directly.

In the new model, the agentic developer’s fulcrum is agentic clarity: externalize the idea so completely that the lever moves code from agent to codebase without pulling a human back into the loop.

Here, documentation can be an enabler, or at least that’s the argument I’ll make. When an agent handles the code, human language becomes the input, not the afterthought.

Plan mode in agentic coding tools is one path to agentic clarity, but plans are one-off. A living document in the workspace gives both agents and humans a shared, versioned reference. Iterative prompting works for contained tasks, but anything that persists (an API consumed by multiple agents, a design revisited across sprints) benefits from a document under configuration management rather than a conversation that evaporates. The development lifecycle sees the same ideas surface repeatedly.

A Structured Review Skill

Firstly, this approach relies on docs as code: maintaining documentation as an artifact alongside source code. You can port the skills to a different workflow, but in my opinion that reduces leverage and adds friction.

At every stage of the development lifecycle, ask: what is the source of information? Is it the developer’s novel idea, or can the agent infer it from standard approaches? With three skills I form a pipeline to capture the relevant information and filter the noise:

Skill	Action	Information	Noise	Refinement
flesh-out	Generate	Expand	Adds (controlled)	Raw ore → shaped material
review-steps	Polish	Preserve	Reduce	Remove impurities
strong-edit	Critique	Enhance	Reduce	Stress-test the structure

A typical pipeline runs flesh-out → review-steps → strong-edit, but the skills can be applied in any order; each one detects what the document needs. All three are broken into numbered stages with developer checkpoints. The agent stops after each stage and waits for approval before proceeding.

graph LR
    raw[Raw notes] --> FO

    subgraph FO ["/flesh-out"]
        fo0(Stage 0
Extract ideas) --> fo1(Stage 1
Research)
        fo1 --> fo2(Stage 2
Structure)
        fo2 --> fo3(Stage 3
Polish)
    end

    FO --> draft[Structured draft]
    draft --> RS

    subgraph RS ["/review-steps"]
        rs0(Stage 0
Language) --> rs1(Stage 1
Clarity)
        rs1 --> rs2(Stage 2
Structure)
        rs2 --> rs3(Stage 3
Consistency)
        rs3 --> rs4(Stage 4
Best practice)
        rs4 --> rs5(Stage 5
Tidy up)
        rs5 --> rs6(Stage 6
Verify links)
    end

    RS --> reviewed[Reviewed draft]
    reviewed --> SE

    subgraph SE ["/strong-edit"]
        se0(Stage 0
Core argument) --> se1(Stage 1
Structure)
        se1 --> se2(Stage 2
Relevance)
        se2 --> se3(Stage 3
Challenge)
        se3 --> se4(Stage 4
Readability)
        se4 --> se5(Stage 5
Edits)
    end

    SE --> final[Final document]

    style raw fill:#f9f,stroke:#333
    style final fill:#9f9,stroke:#333

Writing and editing in stages

What I’m trying to do is find the perfect division of labor. The human brings the problem and the engineering; the agent brings industry techniques, frameworks, and best practices. That allows you to focus your time on what is not in the model’s training data and leverage the agent for everything else.

To do that, on top of breaking editing into three skills, each has a series of stages.

flesh-out: from raw notes to structured content: Takes a skeleton of ideas (bullet points, stream-of-consciousness notes, half-formed thoughts) and expands them into a structured document. The agent first confirms it understands the developer’s intent, then researches, structures, polishes, and tidies up. The risk here is meaning distortion. Raw notes are ambiguous, and assumptions compound. Stage 0 (extract core ideas, developer confirms) exists specifically to catch misunderstandings before they propagate.
review-steps: polish and verify: Takes a structured draft and improves it: language consistency, conceptual clarity, structural compliance, comparison against industry best practice, and link verification. The agent handles mechanical checks and research; the developer holds final authority on judgment calls. Stages 0-4 move progressively from minor editing to conceptual validation. Stage 5 tidies up references; Stage 6 verifies every URL resolves and that agent-sourced references support the claims made.
strong-edit: challenge the content: Takes a complete draft and critiques it: structure, relevance, argument strength, readability. Stages 0-4 are critique only; no edits are made to the document. The agent identifies weaknesses; the developer decides what matters. Edits happen in Stage 5 after the critique is agreed.

Try it out

Clone https://github.com/nakane1chome/claude-skills
Install to your home dir: ./install.sh and follow the prompts
Already have a completed document? Try /review-steps or /strong-edit
Starting from scratch? Write a quick and dirty memo and try /flesh-out
The collection includes other skills; try /review-skill on one of your Claude Code skill files to review its structure and coverage

Caveats & Risks

It’s not perfect yet; some of the issues with the flow are:

False sense of completeness: A large language model (LLM) will confidently proclaim your work to be exceptional and ready to publish. The strong-edit skill exists partly to counter this: the philosophy section puts it directly: “Challenge the author’s assumptions. If something seems unclear, it may be unclear, or wrong. Don’t fix; question.”
Hallucinations and trust: The staged approach constrains this risk but does not eliminate it (see This article for a concrete example from this very post). At every step, the source of information (human idea or agent inference) should be identifiable. The staged process keeps the human engaged rather than rubber-stamping, but link and fact verification remain the developer’s job. Apparently developer trust in AI accuracy remains low (only 43% in recent surveys), for me the key here is checkpoints and grounding, which is done via staged editing.
Sensitivity to generated documents: Some people will reject anything they believe was written by AI, regardless of how it was actually produced. These skills make the human the source of novel information and the agent an editor, but that distinction may not matter to everyone.
Speed: It’s deliberately supervised, which can be slow.

Final Words

Raw ideas are ungrounded and cause confusion. Ground the idea first. Use the agent to find and refine concrete ideas into written language before starting work.

Afterword: Observations from practice

This post

It started as 727 words of raw notes with partial structure. /flesh-out expanded that to 2279 words, tripling the length. The structure held but the agent added padding and hedging.

Human editing cut about 12% (2279 to 2007 words), removing agent-generated filler. /review-steps added the METR productivity trap reference, the developer trust data, and the afterword observations, bringing the word count back up to 2257.

/strong-edit sharpened the opening to lead with the thesis, tightened the leverage shift sentences, and replaced false binary framing. Final word count: ~1930.

Then a link turned up 404. During /review-steps, the agent had added a reference with a fabricated URL. The domain didn’t exist. The link survived human editing and /strong-edit because at every stage the reviewer was evaluating structure, argument, and voice — nobody was clicking links. Stage 6 (verify links) was added to the review-steps skill as a direct result of this failure.

Rough attribution of the final post:

	Human	Agent	Co-created
Ideas and arguments	~95%	~5%	—
Research and references	~50%	~50%	—
Prose (the actual sentences)	~5%	~60%	~35%
Editorial decisions	100%	0%	—

In general

Applied mostly to design and architecture documents, working with Claude Code Opus 4.5 and skills-as-files:

Scaffolding precision: Claude sticks precisely to the staged workflow when actions are defined as skills. This was not the case when just adding project guidelines to CLAUDE.md or AGENTS.md; in both cases the agent did not consistently apply the scaffold.
Mode detection: The agent identifies when a review is not appropriate — e.g., when a largely complete piece would benefit more from strong-edit than review-steps.
Term sensitivity: “Review” signals preservation; “flesh-out” signals generation; “strong-edit” signals critique. The difference in vocabulary shapes the agent’s behavior significantly.

Standout steps

Review vs industry best practice

Stage 4 of review-steps sends the agent to search for relevant frameworks and approaches in the domain, then compares the document against what it finds. Not a substitute for expert review; it’s an automated search with a structured summary. But you won’t be caught off-guard by an obvious gap that a five-minute search would have revealed.

This stage tends to pull in a lot of information. I ask the agent to generate a separate report and keep it out of the document under review. Without that constraint, the agent folds unnecessary details from the research into the document.

Stage 0: Extract Idea (flesh-out) and Core Argument (strong-edit)

As a rough guide, the agent gets around 70% of these stages correct and then asks questions to clarify the remainder. It produces a concise summary of what it understands and what needs clarification, and this is very insightful to interact with.

Flesh-out catching a wrong assumption

On a separate project, I wrote 169 words of raw notes for a CPU model design. Key assumption: “Only the kernel uses the ISA directly in supervisor space.”

/flesh-out expanded to a 1962-word design document (11.6x). On a second pass, the agent researched the actual source tree and found 21+ user-space assembly files across 10 library modules and 3 applications. The assumption was wrong: the ISA boundary was a language boundary, not a privilege boundary.

This changed the design. The context diagram was rewritten from a two-path flow to a 2x2 matrix (kernel/user x assembler/pascal). A fundamental misunderstanding was caught before it shaped the implementation.

Future Extension

The review process produces a specification with a known complexity profile. That profile could select the agent: a high-end model for architectural work, a lightweight model for boilerplate. Model selection as a downstream output of document review, not a separate decision.

Review reports

OpenCL Learning Exercise — Image Transform

Sun, 11 Jan 2026 00:00:00 +0000

This is an old tool written to play with OpenCL (Open Computing Language) on x86_64 and aarch64 machines.

https://github.com/nakane1chome/opencl-learn/tree/master

ImageXform is a command-line tool for learning and experimenting with OpenCL by applying GPU-accelerated kernels to images. It provides a visual, hands-on way to understand OpenCL programming by transforming PNG or JPG images through custom or example kernels.

The tool handles all the OpenCL and image conversion boilerplate (device selection, context creation, buffer management) so the kernel can be launched and results inspected.

It is designed for learning OpenCL concepts like parallel processing, memory management, and kernel optimization through immediate visual feedback.

Code and build system have been slightly modernized after fixing a few bugs.

Example Usage

Transform an image to grayscale:

# Build the project (one-time setup)
make build-amd64

# Run grayscale conversion
cd build.amd64/imgxform
./imagexform -p grayscale.cl -i test_in.png -o output_gray.png

# The tool will:
# 1. Load test_in.png
# 2. Compile grayscale.cl kernel
# 3. Execute it on your OpenCL device (GPU/CPU)
# 4. Save the result to output_gray.png

You can substitute any .cl kernel file to experiment with different transformations like edge detection (edge_3x3.cl) or create your own custom kernels.