How to Build a Production
In July 2025, a Claude Code recursion loop burned between 16,000 USD and 50,000 USD in five hours. There was no crash or error, just agents doing exactly what they were told, indefinitely, because nobody told them when to stop.
Four months later, a four-agent LangChain loop ran for eleven days and cost 47,000 USD. Nobody noticed until the invoice arrived. The pipeline worked correctly in testing, and the agents were doing exactly what they were told. Same pattern.
This tutorial is about that missing instruction.
You'll build five small Python primitives that catch most agent loop failures before they ship:
A spec writerthat forces you to define done before the loop starts
A circuit breakerthat kills the loop when it exceeds hard limits
A ledgerthat records every turn in an append-only SQLite audit trail
An agent loopthat ties all three together
A review surfacethat forces human attestation before downstream systems receive anything
By the end you'll have a working repo you can drop into any agent project. The full code is at github.com/dannwaneri/production-safe-agent-loop.
Why This Keeps Happening
Prerequisites
Phase 1: Define Done Before You Build
Phase 2: Enforce Done at Runtime
Phase 3: Record Everything
Phase 4: The Loop That Respects Its Boundaries
Phase 5: The Review Surface
Phase 6: A Real Example, SEO Audit Agent
Pluggable LLM Client
Running the Tests
What You've Built
Next Steps
Why This Keeps Happening
The math that got companies into trouble was simple. A chatbot costs roughly 0.04 USD per interaction. An orchestrated multi-agent workflow costs 1.20 USD. That's a 30x multiplier — and production benchmarks show it can reach 70x on complex tasks.
The problem isn't that agents are expensive. The problem is that most teams budgeted for chatbot costs and deployed agent architectures. Gartner found the token consumption gap between pilot chatbots and production agent workflows sits at 5-30x. The FinOps Foundation's 2026 State of FinOps report found 73% of enterprises say AI costs exceeded original projections.
The mechanism is straightforward once you see it. When an agent fails a task and retries, it doesn't start fresh. It re-reads the entire context window — every prior failed attempt — before trying again. Iteration one costs 100 tokens. Iteration two costs 200. Iteration ten costs thousands. You're paying for every failure, over and over, in milliseconds.
# This is the entire problem in three lineswhile True: result = agent.run(task) # done when...?That question mark is where the money goes.
The other thing making it worse: agents don't fail loudly. Traditional code hits an undefined state and crashes. An LLM hits ambiguity and tries to be helpful. It retries. It reformats the tool call. It spins up a verification agent. The verification agent finds something. A correction agent fires. Nobody defined what "correct" means. The loop looks beautiful on every dashboard you have — activity, tool calls, completion rate — while quietly burning through your budget.
Gartner predicts that 40% of agentic projects will be scrapped by 2027 due to economic failure. Most of that failure is preventable. Not with better models, but with exit conditions.
Prerequisites
Python 3.10+
An Anthropic API key (or any provider — more on that later)
Basic familiarity with Python classes and SQLite
git clone https://github.com/dannwaneri/production-safe-agent-loopcd production-safe-agent-looppip install -r requirements.txtexport ANTHROPIC_API_KEY=sk-...Phase 1: Define Done Before You Build
The most expensive mistake in agent development isn't a bad model choice or a missing retry limit. It's starting the build before you can answer one question in one sentence:
What does done look like?
Most teams can't answer it. Not because they're careless, but because nothing forces them to before they open the terminal. The spec writer is that forcing function.
# spec_writer.pyfrom spec_writer import SpecWriterspec = SpecWriter(db_path="spec.db").run()When you call .run(), it won't return until you've answered three questions:
What does this do?
What does this NOT do?
What does done look like in one sentence?
The third question is the one that matters. It's also the hardest. "The agent audits the site" is not an answer. "The agent crawls the target URL, extracts all <title>and <meta description>tags, flags any missing or over-length, and stops" is an answer. One of those gives the circuit breaker something to enforce.
The spec stores to SQLite and returns a SpecResultdataclass with a session_id. That ID becomes the thread connecting your spec, your ledger rows, and your loop result. One session, traceable end to end.
@dataclass(frozen=True)class SpecResult: what_it_does: str what_it_does_not: str done_looks_like: str session_id: strfrozen=Truematters. The spec is a commitment, not a draft. Once it's written, the loop runs against it. No mid-run revisions.
For testing, SpecWriteraccepts injectable input_fnand output_fncallables. No stdin monkey-patching required. See tests/test_spec_writer.pyfor working examples — the suite uses a small scripted_inputhelper that returns answers from a generator, and writes to a per-test SQLite file via pytest's tmp_pathfixture. SQLite's :memory:isn't safe here, because SpecWriteropens a fresh connection per method and each :memory:connection is its own isolated database.
Phase 2: Enforce Done at Runtime
Defining the exit condition upstream is discipline. The circuit breaker is enforcement.
# circuit_breaker.pyfrom circuit_breaker import CircuitBreaker, CircuitBreakerErrorbreaker = CircuitBreaker(turn_limit=5, token_limit=15000)breaker.check(turn_count, accumulated_tokens) # raises on breachTwo ceilings. Both hard.
turn_limitcaps how many times the loop can call the LLM. token_limitcaps total token consumption across all turns. Either one tripping raises CircuitBreakerErrorimmediately.
The boundary is strict: turn_count == turn_limitis allowed. turn_count == turn_limit + 1trips. No grace periods or warnings. A hard stop forces a human checkpoint.
from dataclasses import dataclass@dataclassclass CircuitBreakerError(Exception): reason: str # "turn_ceiling" or "token_ceiling" turn_count: int accumulated_tokens: int def __post_init__(self) -> None: super().__init__( f"circuit breaker tripped: { self.reason} " f"(turn={ self.turn_count}, tokens={ self.accumulated_tokens})" )class CircuitBreaker: def __init__(self, turn_limit: int = 5, token_limit: int = 15000) -> None: self.turn_limit = turn_limit self.token_limit = token_limit def check(self, turn_count: int, accumulated_tokens: int) -> None: if turn_count > self.turn_limit: self._trip("turn_ceiling", turn_count, accumulated_tokens) if accumulated_tokens > self.token_limit: self._trip("token_ceiling", turn_count, accumulated_tokens) def _trip(self, reason: str, turn_count: int, accumulated_tokens: int) -> None: print( "\n=== CIRCUIT BREAKER CHECKPOINT ===\n" f"reason : { reason}\n" f"turn_count : { turn_count} / limit { self.turn_limit}\n" f"tokens_used : { accumulated_tokens} / limit { self.token_limit}\n" "action : halt loop, surface to human reviewer\n" "==================================" ) raise CircuitBreakerError( reason=reason, turn_count=turn_count, accumulated_tokens=accumulated_tokens, )CircuitBreakerErroris an exception, not a return code. That's intentional. A return code can be ignored. An uncaught exception can't. Silent breach is impossible. The human-readable checkpoint banner is printed to stdout by _trip()beforethe exception is raised, so even if a caller swallows the exception the operator still sees state.
The critical rule: call .check()beforeevery LLM call, not after. Post-flight checking means you've already burned the tokens before you knew the limit was exceeded.
# Wrong — post-flightresult = client.messages.create(...)breaker.check(turn_count, accumulated_tokens) # too late# Right — pre-flightbreaker.check(turn_count, accumulated_tokens) # raises before any spendresult = client.messages.create(...)The defaults (5 turns, 15,000 tokens) match a tight tutorial demo. Your production budget is different. Tune at instantiation:
# Production example — tighter token budget, more turnsbreaker = CircuitBreaker(turn_limit=10, token_limit=50000)Phase 3: Record Everything
The circuit breaker protects your bank account. The ledger protects your understanding of what happened.
Most teams log for debugging — they want to know what went wrong after it went wrong. The ledger has a different purpose. It's governance. Every row is proof that the loop stayed within its boundaries, or didn't, and exactly when.
# ledger.pyfrom ledger import Ledgerledger = Ledger(db_path="ledger.db")ledger.write( session_id=spec.session_id, turn_count=1, state_origin="llm", input_str=task, token_delta=523, execution_time_ms=1240, pass_fail=True,)One row per turn. Append-only, no updates, and no deletes. The immutability is the point: a ledger you can edit isn't a ledger, it's a notebook.
The schema:
CREATE TABLE IF NOT EXISTS ledger ( id INTEGER PRIMARY KEY AUTOINCREMENT, session_id TEXT NOT NULL, turn_count INTEGER NOT NULL, state_origin TEXT NOT NULL, input_hash TEXT NOT NULL, token_delta INTEGER NOT NULL, execution_time_ms INTEGER NOT NULL, pass_fail INTEGER NOT NULL, -- 1=pass, 0=fail breach_reason TEXT, -- NULL unless circuit breaker fired created_at TEXT NOT NULL -- ISO 8601, UTC);CREATE INDEX IF NOT EXISTS idx_ledger_session ON ledger(session_id);The index makes get_session(session_id)— the primary read path — a constant-time lookup as the ledger grows.
Three decisions worth explaining:
input_hashnotinput_text.The raw input string never persists. Only its SHA-256 hash does. There are two benefits to this: identical inputs across runs are detectable, and PII never enters the audit trail.pass_failasINTEGERnotBOOLEAN.SQLite has no boolean type.1and0are canonical. Clean Python ergonomics at the API edge, correct SQL types on disk.created_atasdatetime.now(timezone.utc).isoformat().datetime.utcnow()was deprecated in Python 3.12. Timezone-aware timestamps avoid the footgun in any system that crosses timezones.
Retrieve by session:
rows = ledger.get_session(spec.session_id)for row in rows: print(f"Turn { row.turn_count}: { 'PASS' if row.pass_fail else 'FAIL'} " f"| { row.token_delta} tokens | { row.execution_time_ms}ms")Phase 4: The Loop That Respects Its Boundaries
The agent loop wires the three primitives together. It's the only component that calls the LLM. Everything else is local.
# agent_loop.pyfrom agent_loop import AgentLooploop = AgentLoop(spec, breaker, ledger, client)result = loop.run(task)# LoopResult(success, turns, total_tokens, session_id, breach_reason)The anatomy of a turn, in order:
circuit_breaker.check(turn_count, accumulated_tokens)— raises if either ceiling is exceededclient.messages.create(...)— the actual LLM callledger.write(...)— one row, append-onlyIf
stop_reason == "end_turn", return. Otherwise loop.
Pre-flight checking before every LLM call, with no exceptions.
def run(self, task: str) -> LoopResult: session_id = self.spec.session_id messages: list[dict] = [{ "role": "user", "content": task}] turn = 0 total_tokens = 0 try: while True: turn += 1 self.circuit_breaker.check(turn, total_tokens) started = time.perf_counter() response = self.client.messages.create( model=self.model, max_tokens=self.max_tokens, system=self._system_prompt(), messages=messages, ) elapsed_ms = int((time.perf_counter() - started) * 1000) turn_tokens = ( getattr(response.usage, "input_tokens", 0) + getattr(response.usage, "output_tokens", 0) ) total_tokens += turn_tokens text = self._text_from(response) messages.append({ "role": "assistant", "content": text}) self.ledger.write( session_id=session_id, turn_count=turn, state_origin="llm", input_str=task, token_delta=turn_tokens, execution_time_ms=elapsed_ms, pass_fail=True, ) if getattr(response, "stop_reason", "end_turn") == "end_turn": return LoopResult( success=True, turns=turn, total_tokens=total_tokens, session_id=session_id, ) messages.append({ "role": "user", "content": "continue"}) except CircuitBreakerError as err: self.ledger.write( session_id=session_id, turn_count=turn, state_origin="circuit_breaker", input_str=task, token_delta=0, execution_time_ms=0, pass_fail=False, breach_reason=err.reason, ) return LoopResult( success=False, turns=turn, total_tokens=total_tokens, session_id=session_id, breach_reason=err.reason, )def _system_prompt(self) -> str: return ( "You are an agent working on a tightly-scoped task.\n\n" f"What this does: { self.spec.what_it_does}\n" f"What this does NOT do: { self.spec.what_it_does_not}\n" f"Done looks like: { self.spec.done_looks_like}\n" )@staticmethoddef _text_from(response) -> str: content = getattr(response, "content", None) if not content: return "" block = content[0] return getattr(block, "text", "") or ""A few choices worth calling out in this body:
The whole
while True:is wrapped in onetry/except CircuitBreakerError.The check happens at the top of every turn, so a breach is caught the same way whether it fires on turn 1 or turn 6.input_str=taskon every ledger row — the original task, not the last assistant message. Theinput_hashcolumn then groups rows that share the same starting input across the run.pass_fail=Truefor every LLM turn that returns,Falseonly on breach. The pass/fail flag tracks whether the loop reachedthe row legitimately, not whether the model's output was good. Quality scoring is a separate concern._system_prompt()uses all three spec fields, not justdone_looks_like. The model needs the negative scope (what_it_does_not) at least as much as the positive scope.time.perf_counter()nottime.time()— monotonic, immune to wall-clock adjustments mid-run.
LoopResult.session_idis inherited from spec.session_id. The ledger rows tie back to the spec without a join. One session ID, one traceable run, start to finish.
Phase 5: The Review Surface
The circuit breaker protects your bank account. The ledger records what happened. But neither tells you whether what happened matched what you promised.
That gap is where bad loops get approved. Polished output, green dashboard, missed commitment. A reviewer sees the artifact, decides it looks acceptable, and signs off. Nobody asked whether the original promise was kept.
The review surface closes that gap. It reads the session from SQLite, assembles the five-element frame, and forces a comparison before anything downstream receives the output.
from review_surface import ReviewSurfacers = ReviewSurface(spec_db_path="spec.db", ledger_db_path="ledger.db")print(rs.render(session_id))Here's the five-element frame, in order:
Original promise— pulled from the spec table: what it does, what it doesn't do, what done looks like
Acceptance criteria— the
done_looks_likefield rendered as the explicit benchmarkDiff— first turn input vs final turn output, turns completed, total tokens, whether the loop breached
Evidence— all ledger rows for the session: turn-by-turn pass/fail, token delta, execution time
Unresolved assumptions— derived from breach rows and failed turns. Empty when clean.
When the reviewer is satisfied, they attest:
attestation = rs.attest( session_id=result.session_id, reviewer="daniel", notes="Output matches spec. Approved.")print(attestation.frame_hash).attest()writes to the attestationstable in ledger.db. The frame_hashis a SHA-256 of the canonical frame data — deterministic across reviewers attesting the same session. It's the audit receipt. It proves the reviewer saw the exact frame as rendered, not a summary or a paraphrase.
Approval confirms the process ran. Attestation confirms the reviewer compared output to commitment. When the loop touches something regulated, those are different legal documents.
@dataclass(frozen=True)class ReviewFrame: session_id: str original_promise: SpecResult acceptance_criteria: str diff: DiffResult evidence: tuple # tuple[LedgerRow, ...] unresolved_assumptions: tuple # tuple[str, ...] created_at: strReviewFrameis frozen for the same reason SpecResultis — the frame is evidence, not a draft. evidenceand unresolved_assumptionsare tuples because lists aren't hashable and frozen dataclasses need hashable fields.
The full end-to-end flow with the review surface lives in examples/review_example.pyin the repo. Run it after any completed session: it renders the five-element frame, prompts for attestation, and writes the receipt if you approve.
The loop runs to you. Downstream systems get nothing until someone signs.
Phase 6: A Real Example — SEO Audit Agent
The pattern only makes sense against a real problem. This is the same agent architecture behind my seo-agent project.
SEO audits have a natural cadence: crawl, surface what's broken, fix, wait for reindex. Running the agent continuously doesn't change that cadence. It just burns tokens in the empty space between the moments that matter. A cron job wired to the loop is the honest architecture.
# examples/seo_audit_example.pyimport requestsfrom bs4 import BeautifulSoupimport anthropicfrom spec_writer import SpecWriterfrom circuit_breaker import CircuitBreakerfrom ledger import Ledgerfrom agent_loop import AgentLoopdef crawl_url(url: str) -> str: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, "html.parser") title = soup.find("title") meta_desc = soup.find("meta", attrs={ "name": "description"}) h1_tags = soup.find_all("h1") return ( f"URL: { url}\n" f"Title: { title.text if title else 'MISSING'}\n" f"Meta description: " f"{ meta_desc['content'] if meta_desc else 'MISSING'}\n" f"H1 count: { len(h1_tags)}\n" f"H1 tags: { [h.text[:50] for h in h1_tags]}" )def run_seo_audit(url: str) -> None: # Step 1: Define done before the loop starts spec = SpecWriter(db_path="spec.db").run() # Step 2: Initialise circuit breaker and ledger breaker = CircuitBreaker(turn_limit=5, token_limit=15000) ledger = Ledger(db_path="ledger.db") client = anthropic.Anthropic() # Step 3: Crawl the URL site_data = crawl_url(url) # Step 4: Run the loop # AgentLoop catches CircuitBreakerError internally and returns # LoopResult(success=False, breach_reason=...). Branch on the # result — do NOT wrap loop.run() in try/except CircuitBreakerError. loop = AgentLoop(spec, breaker, ledger, client) result = loop.run( f"Audit this page for SEO issues:\n\n{ site_data}" ) # Step 5: Print the ledger print(f"\nResult: { 'SUCCESS' if result.success else 'BREACH'}") if not result.success: print(f"Breach reason: { result.breach_reason}") print(f"Turns: { result.turns} | Tokens: { result.total_tokens}") print("\nAudit trail:") for row in ledger.get_session(result.session_id): status = "PASS" if row.pass_fail else "FAIL" print(f" Turn { row.turn_count}: { status} | " f"{ row.token_delta} tokens | { row.execution_time_ms}ms")if __name__ == "__main__": import sys run_seo_audit(sys.argv[1] if len(sys.argv) > 1 else "https://example.com")Run it:
python examples/seo_audit_example.py https://yourdomain.comThe spec writer prompts you. The loop runs, the circuit breaker fires if the limits are exceeded, and the ledger records every turn. The output lands in front of you and you decide what to fix.
The loop runs to you, not into a void.
Pluggable LLM Client
The loop works with any client that satisfies the LLMClientprotocol (Anthropic by default). Bring your own via a ~20-line adapter.
# agent_loop.pyfrom typing import Protocol, runtime_checkable@runtime_checkableclass MessagesEndpoint(Protocol): def create(self, *, model: str, max_tokens: int, system: str, messages: list) -> object: ...@runtime_checkableclass LLMClient(Protocol): messages: MessagesEndpointmessagesis an instance attribute (not a nested class) because that's how the real Anthropic SDK exposes it — anthropic.Anthropic().messages.create(...). Modeling it as a nested class would mean the real client wouldn't satisfy the Protocol. The @runtime_checkabledecorator lets you sanity-check conformance with isinstance(client, LLMClient), and the repo's test suite uses exactly that assertion against the FakeClienttest double.
Here's an OpenAI adapter example (This is illustrative. A production adapter would also map streaming, tool-use, and error shapes.):
# openai_adapter.py — illustrative pseudocode, not production-ready.from openai import OpenAI as _OpenAIclass _MessagesAdapter: def __init__(self, client): self._client = client def create(self, *, model, max_tokens, system, messages): completion = self._client.chat.completions.create( model=model, max_tokens=max_tokens, messages=[{ "role": "system", "content": system}] + messages, ) # Reshape OpenAI's response into the Anthropic-shaped surface # AgentLoop reads: response.usage.{ input,output}_tokens, # response.content[0].text, response.stop_reason. return _adapt_response(completion)class OpenAIAdapter: def __init__(self, api_key: str): self._client = _OpenAI(api_key=api_key) self.messages = _MessagesAdapter(self._client) # instance attr, not a nested classThe adapter pattern is worth teaching explicitly. Provider APIs don't share a shape. Anthropic puts systemat the top level. OpenAI puts it inside the messages array. An adapter shim is ~20 lines and makes the loop provider-agnostic without rewriting anything. Note that self.messagesis assigned in __init__so it's a real attribute on each adapter instance, the same shape as the actual SDK.
Running the Tests
python -m pytest tests/With coverage:
python -m coverage run --source=circuit_breaker,ledger,spec_writer,agent_loop,review_surface -m pytest tests/python -m coverage report -m80 tests, 100% coverage on all five core modules. The loop is exercised against a FakeClienttest double defined inline in tests/test_agent_loop.py. It satisfies the LLMClientprotocol via duck typing: messagesis set to self, so client.messages.create(...)routes back to the same object and ships with scripted responses for each test scenario. Clone the repo and run pytestto see all 80 tests pass without touching the network or needing an API key.
circuit_breaker.pyhas 100% coverage — no untested paths. It's the financial safety component. Every path through it is exercised.
What You've Built
In this tutorial, you've build five small primitives, each independently usable.
| Module | Role | Lines |
|---|---|---|
spec_writer.py | Forces three answers before the loop runs | 104 |
circuit_breaker.py | Hard ceilings on turns and tokens | 41 |
ledger.py | Append-only SQLite audit trail | 113 |
agent_loop.py | The loop that respects both | 128 |
review_surface.py | Assembles the five-element frame, records human attestation | 114 |
The pattern: upstream discipline defines the boundaries. Downstream enforcement breaks the circuit. Neither trusts the model to police itself.
A loop that runs without an exit condition isn't autonomous. It's a billing event waiting to happen.
Define what done looks like before you start. That's the job, and always has been.
Next Steps
The repo is at github.com/dannwaneri/production-safe-agent-loop.
There are three natural extensions if you want to go further:
1. Graduation to Distributed Systems
The SQLite ledger works for isolated sequential loops. The moment you run multiple agents against shared state, you need serializable isolation — concurrent writes to flat JSON corrupt silently. The README documents the three tipping points where a flat ledger needs to graduate.
2. Cryptographic Signing
For compliance-scale systems where the auditor wasn't present when the loop ran, SQLite rows aren't enough. A database admin can run an UPDATEquery. Ed25519 signing wraps each ledger row in a receipt that proves the log wasn't altered after execution. But that's a different tutorial.
Wiring a Cron Job
The honest architecture for the SEO audit agent isn't 24/7 autonomous operation. It's a cron job that runs on schedule, surfaces what's broken, and stops. 0 3 * * 2 python examples/seo_audit_example.py https://yourdomain.comis the whole thing. The loop runs to you, not into a void.
If you need this architecture built for your own stack (circuit breakers, audit trails, production-safe agent loops), I do freelance work. dannwaneri.com/ai-agents/
More From This Topic
View Topic
How to Get Started with NodeJS – a Handbook for Beginners
By Krish JaiswalHello folks! 👋 Recently, I have been learning about Node.js. So I decided to share m …
Motion Detection Tools
Documentation makes codebases accessible to new developers. API documentation, architecture diagrams …
Backlink Analysis and Building Strategies
Monitoring and observability provide insights into system health. Logs, metrics, and traces help dia …
Claims Processing System
Infrastructure as code enables reproducible deployments. Terraform, CloudFormation, and similar tool …
How to Build CRUD Operations with .NET Core – A Todo API Handbook
Welcome to this comprehensive guide on building CRUD operations with .NET Core. We'll use a Todo API …
Setup Wizard Best Practices
Competitive analysis informs strategic marketing decisions. Understanding competitor strengths and w …