Sourcing Guide
How We Evaluate AI Eval Engineers
For recruiters, talent partners, and clients
What This Role Is (and Isn’t)
This Role IS
- Designing evaluation frameworks and benchmarks for LLMs/ML models
- Building automated testing pipelines for model quality
- Red-teaming and adversarial testing
- Defining metrics and rubrics for subjective quality
- Human evaluation workflow design
This Role IS NOT
- ML model training or fine-tuning
- General QA/testing (no ML context)
- Data labeling or annotation (though may design workflows)
- Product management or requirements gathering
- Frontend or infrastructure engineering
Where to Find Candidates
Target Companies (APAC)
- AI Safety/Alignment: Companies working on RLHF, constitutional AI, red-teaming
- LLM Platforms: Teams building evaluation for chatbots, copilots, agents
- ML Quality: Companies with dedicated model quality or evaluation teams
LinkedIn Search Strings
Screening Criteria
Dimension
1 — Weak
3 — Good
5 — Exceptional
Evaluation Design
Only uses accuracy/F1. No custom metrics.
Designs task-specific benchmarks. Understands metric limitations.
Builds evaluation frameworks used across teams. Novel metrics for subjective quality.
Python & Tooling
Scripts only. No testing frameworks.
pytest, CI/CD integration, data pipelines for eval.
Designs evaluation platforms. Automated regression detection.
Statistical Rigor
Reports numbers without confidence intervals.
Understands significance, sampling, inter-rater reliability.
Designs experiments. Power analysis. Handles distribution shift.
LLM Knowledge
Uses LLMs but can't evaluate them systematically.
Prompt-based evaluation, rubric design, human-AI agreement.
Red-teaming expertise. Safety evaluation. Multi-turn evaluation.
Startup Fit
Needs detailed specs. Waits for direction.
Self-directed. Scopes own work. Communicates proactively.
Founder-mentality. Owns outcomes end-to-end.
Interview Process
Step 1: Resume Screen (5 min)
- Has built evaluation pipelines or benchmarks
- Python as primary language
- Experience with LLM/ML evaluation (not just model training)
Step 2: Technical Screen (30 min)
- “Walk me through an evaluation framework you designed. What metrics did you choose and why?”
- “How would you evaluate an LLM chatbot for hallucination?”