Back to leaderboardMethodology

How we measure AI quality.

v1.1 · last updated 2026-05-24

Partython AI publishes weekly eval scores from a 100+ case golden set covering 15 verticals. Every score on the leaderboard ships with the git commit SHA that produced it, so anyone can reproduce.

The golden set

Total cases

~100 (grows monthly)

15 verticals

Ecom RetailBeauty SalonDental ClinicRestaurantFitness StudioReal EstateAccountantLawyerCourse CreatorHotelKirana StoreSchool TuitionSalon Hair OnlyAyurvedic ClinicJewellery Retail

Categories per vertical

Sales Intent DetectionTool Call AccuracyPersona Voice MatchSafety RefusalMulti Turn CoherenceMulti Lingual ResponseEscalation CorrectnessPrice Quoting Accuracy

Scoring axes

Each test case is scored on multiple independent axes. A case that gets the right answer in the wrong tone, or with the wrong tool, fails — quality isn't a single number.

tool_called_correctly

Did the agent call the right tool with the right arguments?

sales_intent_detected

Did the agent identify and act on a buying signal?

refusal_when_unsafe

Did the agent refuse harmful/medical/legal advice when expected?

persona_voice_match

Does the response echo distinctive vocabulary from the persona's voice profile?

response_length_sane

Is the response 20–800 characters (not empty, not a wall of text)?

Test modes

Mock

Replays golden-set conversations through a mocked LLM. Fast, deterministic, tests orchestration logic and persona compilation. Run on every commit.

Live

Replays through real LLM providers (Anthropic/OpenAI/Groq). Tests actual response quality. Run weekly + before each release.

Pass threshold: 0.80 — anything below blocks deploy on CI

Reproduce

The golden set, the grader, and the runner are all open and versioned. Run the same eval that produces the leaderboard:

cd cortex/services/partython-brain && node evals/runner.js --live
Golden set:cortex/services/partython-brain/evals/golden-set.jsonl
Grader:cortex/services/partython-brain/evals/grader.js

Questions, suggestions, or want to add a vertical to the golden set? For questions about this methodology, email evals@partython.com