Contract or Full-Time
AI Eval Engineer
Remote or Hybrid US Hours Required
PythonLLM APIsClaude CodeStatistical AnalysisPrompt EngineeringCI/CD
Why This Role Exists
AI systems are probabilistic, not deterministic. Traditional QA asks “does this pass or fail?” AI evaluation asks “why did the model give this answer — and can we trust it?” As companies move AI from demos to production, someone needs to build the systems that answer that question rigorously and at scale.
This is a new discipline. There’s no established playbook. The startups we work with need engineers who can design evaluation frameworks from scratch — benchmarks, red-teaming pipelines, safety testing, regression suites — and iterate as models and requirements change.
What You’d Work On
- LLM evaluation pipelines — accuracy, hallucination detection, safety, and trustworthiness scoring
- Benchmark design — create and maintain domain-specific evaluation suites that catch real failures
- Red-teaming and adversarial testing — systematically probe models for failure modes and edge cases
- Automated regression testing — catch quality regressions when models are updated or fine-tuned
- Human-in-the-loop evaluation workflows — design rubrics, manage annotation, measure inter-rater reliability
Who We’re Looking For
- Strong Python — testing frameworks, data analysis, scripting. You build evaluation systems, not just run them.
- Thinks in probabilities — you understand that AI outputs aren’t pass/fail. You can reason about confidence, uncertainty, and when a model’s answer is “good enough.”
- LLM/ML evaluation experience — you’ve built or maintained evaluation pipelines for real AI products
- Statistical rigor — understands metrics, significance, sampling, inter-rater reliability
- Prompt engineering depth — can design evaluation prompts, judge models, and build scoring rubrics
- Comfortable with ambiguity — evaluation criteria are often subjective and evolving. You define the standard, not just enforce it.
- AI-first workflow — you use Claude Code, Cursor, or similar to move fast
What Worca Offers
- Interesting work — directly with founding teams at AI startups
- Flexibility — hourly, part-time, or full-time
- USD compensation — competitive, benchmarked to the role
- Continuity — Worca manages employment and matches you to next engagements
Engagement Structure
- Type: Contract (hourly or monthly) or full-time
- Trial: 2-4 week trial project
- Timezone: APAC-based, US hours overlap required
- Location: Remote — Philippines, Taiwan, Singapore, or broader APAC
How to Apply
Send your resume and a brief note on evaluation work you’ve done to careers@worca.io.
Talent partners: see our sourcing and evaluation guide for this role.