“Before you can judge a model, you need to understand what you’re looking at.”
You’re here to learn. AI evaluation is a new discipline and most people entering it have never done it before — that’s fine. What matters is that you’re precise, you follow instructions exactly, and you’re building the foundation that everything else rests on.
At L1, you’re not billing-ready. Worca is investing in you. You follow annotation guidelines, learn how LLM APIs work, run scripts other people wrote, and start developing the instinct for “this output looks wrong.” That instinct is worth more than you think. Most people look at AI output and can’t tell good from bad. You’re training your eye.
The work is unglamorous. Labeling data. Running evaluation scripts. Reading rubrics. Comparing model outputs against gold standards. But this is where every great eval engineer starts — by looking at thousands of AI outputs and learning what “good” actually means in context.
What You Do
- Follow annotation guidelines — label data according to detailed rubrics. Accuracy and consistency matter more than speed.
- Learn LLM APIs — understand how to call models, what parameters do, how to structure prompts. Build comfort with the tools.
- Run existing eval scripts — execute evaluation pipelines that L3+ engineers have built. Learn what the scripts measure and why.
- Compare outputs — look at model outputs side-by-side with reference answers. Document discrepancies clearly.
- Flag issues — when something looks wrong, say so. Write it down with specific examples. Don’t just say “it feels off.”
- Learn Python basics — you don’t need to be a software engineer, but you need to read and modify scripts.
- Document everything — your notes become data. Sloppy notes are useless notes.
AI Skills Required
- Basic LLM API usage — send prompts, receive completions, understand temperature, top-p, and token limits
- Annotation tool proficiency — Label Studio, Prodigy, or similar tools for structured data labeling
- Prompt reading comprehension — understand what a prompt is trying to do and whether the model’s response achieves it
- AI-assisted learning — use Claude or ChatGPT to understand evaluation concepts, Python syntax, and statistical terms you encounter
- Basic scripting — run Python scripts from the command line, modify parameters, read output
Self-Evaluation Checklist
- I can follow a detailed annotation rubric and maintain over 90% agreement with gold standard labels
- I understand what LLM API parameters (temperature, top-p, max tokens) do and how they affect output
- I can run an eval script, read the output, and explain what the metrics mean
- I document issues clearly — specific examples, not vague complaints
- I can write basic Python — loops, functions, file I/O, working with JSON
- I’ve labeled 500+ data points and understand inter-annotator agreement
- I ask questions when guidelines are ambiguous instead of guessing
Training Curriculum
Month 1: Foundations
- Annotation Boot Camp — intensive training on annotation guidelines, rubric interpretation, and consistency. Practice on real datasets with feedback from L3+ mentors.
- LLM API Fundamentals — hands-on exercises calling OpenAI, Anthropic, and open-source model APIs. Understand request/response structure, error handling, rate limits.
- Python for Evaluators — not software engineering Python. Evaluator Python: reading CSVs, parsing JSON, running scripts, basic data manipulation with pandas.
- Evaluation Concepts — what are precision, recall, F1? What is inter-annotator agreement? What does a confusion matrix tell you? Build intuition, not just memorization.
Month 2-3: Applied Practice
- Real Annotation Projects — work on actual client annotation tasks under supervision. Your labels get reviewed by L3+ engineers.
- Eval Script Reading — read and understand existing eval pipelines. Trace the logic. Modify parameters. Run experiments.
- Output Comparison Exercises — given two model outputs for the same prompt, which is better? Why? Write it up. Get feedback.
- Failure Mode Catalog — start building a personal catalog of ways models fail. Hallucinations, refusals, format errors, subtle factual mistakes, instruction-following failures.
Ranking Standard
| Metric | Threshold | How It’s Measured |
|---|---|---|
| Annotation accuracy | 90%+ agreement with gold standard | Spot-check audits |
| Label consistency | Cohen’s kappa above 0.8 with senior annotators | Inter-annotator agreement |
| Script execution | Can run eval scripts independently | Mentor observation |
| Documentation quality | Issues flagged with specific examples | Review of notes |
| Python basics | Can modify script parameters and read output | Practical assessment |
Promotion to L2
Requirements
- Minimum 3 months at L1
- Pass L2 qualification assessment:
- Annotation accuracy test — label a held-out dataset. Measured against gold standard and senior annotator consensus.
- API comprehension — demonstrate understanding of LLM API parameters and their effects on output quality.
- Script proficiency — given an eval script, explain what it does, modify a parameter, run it, and interpret the results.
- Issue documentation — present 5 well-documented issues you flagged during annotation or eval runs.
- Mentor confirmation of readiness
- Consistent attendance and engagement in training sessions
What the Panel Looks For
- Precision — are they careful? Do they catch details? Sloppy annotators become sloppy evaluators.
- Curiosity — do they ask why the model did something, or just label it wrong and move on?
- Reliability — do they show up, follow through, and meet deadlines?
- Learning velocity — are they measurably better now than when they started?
Mentorship at This Level
- You receive: L3+ mentor, weekly check-ins. Focus on annotation quality, Python fundamentals, and building evaluation intuition.
- You give: Nothing yet — focus on learning. But share interesting failure modes you find with the team.
- Exposure: Observe L3+ eval framework design sessions. You’re not contributing yet, but start understanding how evaluation systems get built.
What Unlocks at L2
- Discount billing rate — you start generating revenue
- Test case writing — you design tests, not just run them
- More independence — less supervision, more ownership of annotation tasks
- First steps toward eval framework understanding