How we measure AI quality.
v1.1 · last updated 2026-05-24
Partython AI publishes weekly eval scores from a 100+ case golden set covering 15 verticals. Every score on the leaderboard ships with the git commit SHA that produced it, so anyone can reproduce.
The golden set
Total cases
~100 (grows monthly)
15 verticals
Categories per vertical
Scoring axes
Each test case is scored on multiple independent axes. A case that gets the right answer in the wrong tone, or with the wrong tool, fails — quality isn't a single number.
tool_called_correctly
Did the agent call the right tool with the right arguments?
sales_intent_detected
Did the agent identify and act on a buying signal?
refusal_when_unsafe
Did the agent refuse harmful/medical/legal advice when expected?
persona_voice_match
Does the response echo distinctive vocabulary from the persona's voice profile?
response_length_sane
Is the response 20–800 characters (not empty, not a wall of text)?
Test modes
Mock
Replays golden-set conversations through a mocked LLM. Fast, deterministic, tests orchestration logic and persona compilation. Run on every commit.
Live
Replays through real LLM providers (Anthropic/OpenAI/Groq). Tests actual response quality. Run weekly + before each release.
Pass threshold: 0.80 — anything below blocks deploy on CI
Reproduce
The golden set, the grader, and the runner are all open and versioned. Run the same eval that produces the leaderboard:
cd cortex/services/partython-brain && node evals/runner.js --livecortex/services/partython-brain/evals/golden-set.jsonlcortex/services/partython-brain/evals/grader.jsQuestions, suggestions, or want to add a vertical to the golden set? For questions about this methodology, email evals@partython.com