Eval-Driven Development

Summary

Build a real eval suite from zero with Evalite, deterministic checks, LLM-as-judge, and the data flywheel that turns user feedback into a system that improves itself. The unit tests for probabilistic software.

Loading Evals Are the Unit Tests of AI…

Loading The Demo-to-Production Gap…

Loading Setting Up Evalite…

Loading Deterministic Evals…

Loading LLM-as-a-Judge…

Loading Human Evaluation…

Loading Building a Representative Dataset…

Loading Scoring Strategies…

Loading Local, CI, and Daily Runs…

Loading The Data Flywheel…

Loading Evals as a Regression Net…

Loading Eval-Driven Development in Practice…

Loading Building Your Eval Suite…

Loading About Roger…

Do you like my content?

Sponsor Me On Github

Keep reading

AI Project Management: Shipping Software When Agents Write the Code

Something strange happened to my job this year. The typing got cheap. An agent can turn a paragraph of intent into a working pull request while I refill my coffee, and yet the projects did not get faster. Not really. More code shipped, more PRs merged, and somehow the same features took roughly the same number of weeks to actually land. That gap is what this guide is about. When machines write the code, the scarce thing is no longer the writing. It's the specifying, the slicing-up, and the reviewing. Project management stopped being about handing out work and started being about feeding agents instructions clear enough that their output is worth keeping, then catching the parts that aren't. I've been running this way for a while now: specs before code, work cut into agent-sized pieces, a small fleet on separate worktrees, and a review loop that assumes the machine got something subtly wrong. This is the playbook I wish someone had handed me. It covers spec-driven development, decomposition, acceptance criteria an agent can test against, parallelism without chaos, the review loop, governance for code you didn't write, and the handful of metrics that tell you whether any of it is actually working. It's written for engineers and tech leads who already use agents to write code and want to run real projects with them, not demos. _This is a living document and will be updated as the tools and the workflows keep evolving._

Read guide

Mastering Hermes

Most of this series treats Hermes as the place its ideas land. Building Your Agentic OS pivots to it, Running the Fleet orchestrates on it, Self-Hosting the Agentic Stack deploys it. What none of them do is sit down and cover Hermes itself, the whole thing, every feature, the way I'd walk a competent developer through a tool they've never run in anger. That's this guide. Hermes is Nous Research's open-source, self-hosted agent, the one with a built-in learning loop: it writes its own skills from experience, curates its own memory, searches its own past conversations, and builds a deepening model of who you are across sessions. It installs with one command, runs on a five-dollar box or a GPU cluster, talks to whatever model you point it at, and you can message it from Telegram while it works on a cloud VM. That surface is a lot bigger than the rest of the series has needed to show. So here we go wide instead of deep: install and first contact, the model layer and Nous Portal, the terminal interface, the messaging gateway across six platforms, the six places it can run, the learning loop, context files and personality, tools and toolsets, MCP, scheduled automations, delegation and subagents, the security model, day-two operations, and migrating in from OpenClaw. Where a topic has its own field guide in this series, I point you there instead of repeating it. This is the manual that ties the rest together. _This is a living document and will be updated as Hermes updates._

Read guide

The Agentic Playbook

Everyone agrees you should be building agents. Nobody agrees on how. One camp says drag nodes on a canvas and ship this afternoon. The other says real agents live in code, in your repo, behind your own API. They're both right, and the argument is a distraction: the loop is the same either way. A model, a set of tools, a memory, and a stopping condition. Once you see that, the question stops being "which side is right" and becomes "which lane fits this job." This playbook walks both lanes properly. The first half builds agents in n8n: the AI Agent node, tools it can actually call (including MCP servers), memory and RAG on the canvas, human approval gates, multi-agent patterns, and the Evaluations feature that tells you whether any of it works. The second half builds the same ideas in TypeScript with the Vercel AI SDK inside an Astro site: a streaming chat endpoint, real tool definitions with schemas, the ToolLoopAgent, approval gates in code, structured output, and MCP as the bridge that lets your n8n workflows and your code agents share the same tools. It pairs with [Mastering n8n](/guides/mastering-n8n) (which covers hosting and hardening the platform itself) and [Roll Your Own Coding Agent](/guides/roll-your-own-coding-agent) (which builds the raw loop from nothing), and once you're shipping agents, [QA in the Era of AI](/guides/qa-in-the-era-of-ai) shows what happens when you point them at your test suite. This one is about shipping: picking a lane, building the agent, and knowing when to switch lanes as the job outgrows the canvas. _This is a living document and will be updated as the tools and patterns evolve._

Read guide