AI QA & Eval Engineer
One Person. Every Failure Mode. Every Benchmark.
AI systems are probabilistic. Traditional QA asks “does this pass or fail?” AI QA asks “why did the model do this — and should we trust it?” That question is harder than it sounds, and most companies don’t have anyone who can answer it rigorously.
This role is the answer.
Evaluation is a new discipline. There’s no established playbook. The companies we work with are shipping AI into production — healthcare, semiconductor, fintech, legal — and they need people who can build the systems that measure whether AI actually works. Not vibes. Not demos. Rigorous evaluation frameworks that catch real failures before users do.
This isn’t a job description. It’s a career architecture. Ten levels. L1-L5 you find problems. L6-L10 you design how organizations evaluate AI. Your statistical rigor is the foundation. Your judgment about what “good” looks like is what makes you irreplaceable.
The 10 Levels
Follow annotation guidelines, learn LLM APIs, run basic eval scripts. Not billing-ready — Worca invests in you.
Run eval scripts, write test cases, flag issues with evidence. Discount billing — client knows they’re developing you alongside us.
Design eval frameworks, build benchmarks, red-team models. Standard rate. The default Worca placement level.
Diagnose WHY models fail, not just that they failed. Statistical rigor, root cause analysis, actionable fixes. Premium rate.
Coach L1-L4, build QA playbooks, manage eval teams. Can implement light fixes — prompt engineering, data curation. The bridge to architecture.
The taste gate. Design eval methodology across domains and organizations. Eval as a discipline, not just a task.
Eval strategy at company level. Regulatory compliance, safety and alignment testing, audit frameworks. Leadership trusts your judgment.
Multi-company eval standards. Industry benchmarks. The person companies call when they need eval methodology for a new domain.
Builds and leads eval organizations. Hires, trains, defines culture. The person who makes eval teams exist.
Defines industry evaluation standards. Portfolio of eval frameworks across multiple companies and domains. By invitation only.
Two Tracks, One Path
L1-L5: QA & Eval. You find problems. The question at every level is how well do you evaluate AI? Annotation quality, benchmark design, red-teaming depth, and growing diagnostic ability. L1-L3 are trainable execution skills — Python fluency, statistical literacy, and discipline get you there. L4 adds root cause analysis. L5 proves you can develop other evaluators.
L6-L10: Eval Architecture. You design how organizations evaluate AI. The question shifts to are you measuring the right things? Methodology design, cross-domain frameworks, regulatory compliance, safety standards. L6 is the methodology gate — going from doing eval to designing eval systems. Good evaluators are everywhere, but architects who can look at a new domain and design the evaluation framework from scratch? That takes a kind of rigor that can’t be taught the same way Python can.
| Transition | Minimum Time | Cumulative from L1 |
|---|---|---|
| L1 → L2 | 3 months | 3 months |
| L2 → L3 | 6 months | 9 months |
| L3 → L4 | 12 months | ~2 years |
| L4 → L5 | 18 months | ~3.5 years |
| L5 → L6 | 24 months | ~5.5 years |
| L6 → L7 | 24 months | ~7.5 years |
| L7 → L8 | 36 months | ~10.5 years |
| L8 → L9 | 36 months | ~13.5 years |
| L9 → L10 | By invitation | 15+ years |
No fast-tracking past L7. Methodology, trust, and judgment take time. There are no shortcuts at the top.
The Key Gates
Each level has one checkpoint question. If you can’t answer “yes” with evidence, you’re not ready.
| Level | Gate |
|---|---|
| L1 | Can they follow annotation guidelines accurately and learn LLM APIs in a guided environment? |
| L2 | Can they run eval scripts, write test cases, and flag issues with clear evidence? |
| L3 | Can they design an eval framework and build benchmarks for a domain they haven’t seen before? |
| L4 | Can they diagnose WHY a model fails, not just that it fails — and suggest concrete fixes? |
| L5 | Can they coach L1-L4 and make the whole eval team more rigorous? |
| L6 | Can they design eval methodology for a new domain from scratch? |
| L7 | Can they build company-level eval strategy that satisfies regulatory and safety requirements? |
| L8 | Can they define eval standards that multiple companies adopt? |
| L9 | Can they build and lead an eval organization from zero? |
| L10 | Have they shaped how the industry evaluates AI? |
Where Do You Fit?
Click any level above to see its full evaluation criteria, training curriculum, and promotion requirements. Read the descriptions honestly — place yourself where you actually are, not where you want to be. Then work the path.
Why This Path Exists
AI evaluation is one of the fastest-growing disciplines in tech, and one of the least understood. Most companies treat eval as an afterthought — run a few tests, check a dashboard, ship. Then the model hallucinates in production and nobody knows why.
The Worca path is different. Every level you climb:
- Increases your rate — clients pay more for higher-ranked eval talent
- Compounds your skills — each level adds new capabilities on top of everything before
- Builds your portfolio — real eval frameworks deployed at real companies, not academic papers
- Creates passive income — mentee referral cuts reward you for developing others
- Makes you irreplaceable — by L6+, you’re not interchangeable. You’re a specific person clients want because you’ve evaluated their domain before.
AI doesn’t replace you. AI is what you evaluate. What AI can’t do is judge itself — decide what “good” means, design the tests that catch real failures, and build the frameworks that make organizations trust their AI systems. That’s what you do. That’s what makes you valuable.
For companies: See why clients choose Worca AI eval talent.
How to Apply
Send your resume and a brief note on evaluation work you’ve done to careers@worca.io.
You don’t need to be experienced. You need to be rigorous, statistically literate, and willing to find problems others miss. Everyone starts at L1. Every Worca Partner, AI Eval once stood where you stand.
Talent partners: see our sourcing and evaluation guide for this role.