Principled scaffolding for reliable, deterministic AI agents.
A structured framework for designing agentic workflows that are deterministic, verifiable, and scalable. Move beyond prompt engineering — engineer the process.
The framework was created to address a fundamental limitation of today’s AI agents: non-determinism and lack of verifiability. Here is the problem we solve and where this approach sits among existing methods.
Powerful LLMs have enabled autonomous agents (e.g. SWE-agent, Devin) that promise to automate complex tasks. The dominant design puts the LLM at the center of the cognitive loop—entrusting it with both task execution and high-level planning. That creates a core weakness: LLMs are probabilistic and opaque, so agent behavior is inherently non-deterministic. The same input can succeed today and fail tomorrow. This blocks adoption in mission-critical settings where we need reliability and verifiability—for example, modifying a production codebase.
We shift control out of the LLM and into a structured, machine-readable workflow. The agent’s “brain” is a deterministic engine that runs a predefined plan based on explicit, file-system state. The LLM becomes a controlled tool—a Skill—invoked by that engine for specific, well-defined tasks. The framework defines a standard Workflow Unit: five components (Workflow.md, phases/, skills/, definitions/, templates/) and a Deterministic Decision Tree that branches only on file/directory existence. It also supports workflow composition: master workflows orchestrate sub-workflows with serial execution, gates, and iterative loops. The result is a process that combines LLM capability with the reliability of classical software engineering.
The chart below places Agentic Workflow relative to other approaches. The horizontal axis is determinism (low → high); the vertical axis is scope (single-task → end-to-end). We aim for high determinism and broad scope—reliable, verifiable, end-to-end workflows—which is where Agentic Workflow (and similar designs) sit, distinct from LLM-driven agents (broad scope, lower determinism) and traditional static analysis (high determinism, narrower scope).
A standard workflow is a directory with five kinds of components that together define a complete, executable process. The same structure also supports workflow composition: multiple self-contained sub-workflows can be orchestrated by a master workflow to run in sequence, with gates and iterative loops.
Workflow.md (entry point) — The brain of the unit. It holds a deterministic decision tree that branches only on file/directory existence (explicit state), not on LLM output. It decides which phases run and when to move on.phases/ (execution steps) — An ordered sequence of Markdown files (e.g. 01-overview.md, 02-analysis.md). Each phase defines a sub-goal and orchestrates the skills needed to achieve it.skills/ (atomic capabilities) — One Markdown file per atomic task, often one LLM call. A skill is stateless: it takes context, applies logic (e.g. a detailed prompt), and produces an output. Skills are the “how” that phases invoke.definitions/ (domain knowledge) — Stable, reusable knowledge (coding standards, review patterns, output schemas, validation rules). Skills reference these so that knowledge lives outside prompts and is easy to update.templates/ (output formats) — Structures for what skills produce (e.g. report templates). They keep outputs consistent and machine-readable.Together: Workflow orchestrates phases → phases invoke skills → skills use definitions and templates to produce structured output.
When a phase runs, it delegates to one or more skills. Each skill reads the current context, pulls in the relevant definitions and templates, and writes structured output. The flow is: Phase → Skill → (Definitions + Templates) → structured output.
Sub-workflows can each be a full workflow unit (with their own Workflow.md, phases/, skills/, etc.). A master workflow treats each sub-workflow as a single step and can:
definitions/ can be inherited by all sub-workflows, which may add their own local definitions.For a detailed walkthrough with real file layouts, see the Example section and the Agentic Code Assurance repository.
Each principle addresses a specific failure mode of naive agentic systems. Together, they form a complete framework for reliable, production-grade workflows.
docs/codearch/ is the contract between Stage 1 and Stage 2; task_list.md is the contract between Stage 2 and Stage 3.
build_success=false or tests_runnable=false, the entire workflow halts immediately with a clear error report. There is no point analyzing code that cannot be compiled or tested.
A comprehensive examination of each design principle — the rationale, design significance, and how it manifests in a real workflow.
Workflow.md begins with a quick decision tree (Q1, Q2, Q3...) where each question provides clear Yes/No branches that map directly to specific actions. Each question is paired with an independent "Judgment Basis" section containing exhaustive checklists and executable verification commands.In 1-code-cognition/Workflow.md, Q1 "Does the overall report exist?" has four explicit checks: (1) docs/codearch/overall_report.md exists; (2) it contains a non-empty "Project Goals" section; (3) it has "Inputs", "Outputs", "Main Flow" sections; (4) it has an "Information Sources" section. If any check fails, Phase 01 is triggered — zero ambiguity.
docs/codearch/ represents Stage 1 completion, docs/risk_tasks/ represents Stage 2 output, and docs/remediation/ represents Stage 3 artifacts. Every key artifact has a corresponding structure definition document specifying required sections, fields, and downstream usage conventions.task_output_structure.md defines not only the required fields for each task record (location, description, risk type, related module, reasoning chain, excluded protections, preconditions to verify, impact level), but also a "Downstream Usage Conventions" section that explicitly guides Stage 3 on how to consume this information — how to locate code, understand reasoning chains, and design targeted verification tests, ensuring lossless and efficient cross-stage information transfer.
A 100-module project might exceed hundreds of thousands of tokens if all module reports are combined. Under this principle, when analyzing a specific bug, the agent loads only overall_report.md (~2k tokens) and 1–2 relevant module reports (~5k tokens each) — keeping total consumption extremely low at approximately 12k tokens per query.
02-review.md (Phase) states the task is "perform deep review." The review method is detailed in skill-02-review.md (Skill) — covering batching strategy, depth determination, and pattern application. "Review patterns" are abstracted into review_patterns.md (Definition), loaded on demand by the Skill. Feedback conventions are defined centrally in the root definitions/feedback_protocol.md, referenced by all stages via link.
When reviewing concurrency race conditions (Pattern C-1: Lock Order Consistency), the agent does not simply search for lock keywords. Instead, it loads "concurrency invariants" from the module report, identifies the agreed lock acquisition order (e.g., lock A before lock B), then searches the code for reverse acquisition paths. This agreement-based check is far more precise than undirected scanning.
For a suspected buffer overflow, the agent writes a test_buffer_overflow test passing an overly long string. If the program crashes, the bug is "confirmed." After the fix (e.g., adding length validation), the verification test passes. A full regression suite run confirms no side effects, and the test is permanently integrated into the official test suite.
After the agent's first module decomposition, the structured review discovers that the utils module is overly broad. The review fails, generating a change list: "split utils into string_utils, net_utils, math_utils." The workflow rolls back to Phase 02 to regenerate reports for the three new modules, then reviews again — repeating until the decomposition is sound.
build_success=false) or unit tests cannot run (tests_runnable=false), the entire workflow halts immediately with a clear error report and blocking reason.During Stage 1 Phase 03, the agent attempts compilation but fails due to a missing dependency. build_and_tests.md marks build_success as false. The decision tree evaluates Q3 as "No," triggering the hard gate. The agent reports the error and halts the workflow, waiting for the user to fix the build environment before re-executing.
During Stage 2 review, the agent discovers that a module's concurrency model description doesn't match the actual code — the report marks it as "single-threaded," but the code uses std::thread. The agent immediately updates the module report's "Concurrency Model" section and records this feedback in the change log, keeping the knowledge base accurate for downstream consumers.
The Agentic Code Assurance workflow instantiates all 9 principles across three sequential, contract-bound stages.
Agentic workflows with principled scaffolding address the fundamental limitations of existing approaches.
| Capability | Traditional Static Analysis (Coverity, Clang-Tidy) |
Monolithic LLM Agent (SWE-agent, OpenDevin) |
Principled Agentic Workflow (This Framework) |
|---|---|---|---|
| Semantic Understanding | ✗ Pattern-based only | ~ Context-limited | ✓ Deep semantic knowledge base |
| False Positive Rate | High — no intent model | Medium — hallucination risk | Low — knowledge-grounded analysis |
| Deterministic Behavior | ✓ Rule-based | ✗ Non-deterministic | ✓ Decision-tree driven |
| Verifiable Correctness | ~ Alerts only | ✗ No test enforcement | ✓ TDD-based verification |
| Scales to Large Codebases | ✓ File-by-file | ✗ Context window limits | ✓ Minimum context principle |
| Resumable / Idempotent | ✓ | ✗ Stateless per session | ✓ File-system state |
| Detects Logic Bugs | ✗ Syntax/pattern only | ~ Unreliable | ✓ Semantic path tracing |
| Debuggable Process | ~ Report only | ✗ Black box | ✓ Every decision is traceable |
| Accumulates Test Assets | ✗ | ✗ | ✓ Tests integrated permanently |
Explore a real-world implementation of these principles applied to C/C++ code quality and test refactoring.
Adopt these principles to build your own domain-specific agentic workflow in four steps.
Workflow.md with a Q&A decision tree. Each question must be answerable by a shell command or file check — never by the LLM's opinion. Include the exact verification commands in a "Judgment Basis" subsection.Explore the open-source example, study the workflow structure, and apply these principles to your own domain.