I built Iris because I kept shipping agents with no way to know if the outputs were actually good. Infrastructure monitoring sees 200 OK and moves on — it has no idea the agent just leaked PII or burned $40 on a single query.
Iris sits at the MCP protocol layer and evaluates every agent output against 12 deterministic rules. No LLM-as-judge — regex, thresholds, keyword checks that fire or don't. Independently rated AAA by Glama.
I built Iris because I kept shipping agents with no way to know if the outputs were actually good. Infrastructure monitoring sees 200 OK and moves on — it has no idea the agent just leaked PII or burned $40 on a single query.
Iris sits at the MCP protocol layer and evaluates every agent output against 12 deterministic rules. No LLM-as-judge — regex, thresholds, keyword checks that fire or don't. Independently rated AAA by Glama.
One command to try it:
npx @iris-eval/mcp-server