Career Level

L1 — AI QA Engineer, Trainee

February 2026

“Before you can judge a model, you need to understand what you’re looking at.”

You’re here to learn. AI evaluation is a new discipline and most people entering it have never done it before — that’s fine. What matters is that you’re precise, you follow instructions exactly, and you’re building the foundation that everything else rests on.

At L1, you’re not billing-ready. Worca is investing in you. You follow annotation guidelines, learn how LLM APIs work, run scripts other people wrote, and start developing the instinct for “this output looks wrong.” That instinct is worth more than you think. Most people look at AI output and can’t tell good from bad. You’re training your eye.

The work is unglamorous. Labeling data. Running evaluation scripts. Reading rubrics. Comparing model outputs against gold standards. But this is where every great eval engineer starts — by looking at thousands of AI outputs and learning what “good” actually means in context.

What You Do

Follow annotation guidelines — label data according to detailed rubrics. Accuracy and consistency matter more than speed.
Learn LLM APIs — understand how to call models, what parameters do, how to structure prompts. Build comfort with the tools.
Run existing eval scripts — execute evaluation pipelines that L3+ engineers have built. Learn what the scripts measure and why.
Compare outputs — look at model outputs side-by-side with reference answers. Document discrepancies clearly.
Flag issues — when something looks wrong, say so. Write it down with specific examples. Don’t just say “it feels off.”
Learn Python basics — you don’t need to be a software engineer, but you need to read and modify scripts.
Document everything — your notes become data. Sloppy notes are useless notes.

AI Skills Required

Basic LLM API usage — send prompts, receive completions, understand temperature, top-p, and token limits
Annotation tool proficiency — Label Studio, Prodigy, or similar tools for structured data labeling
Prompt reading comprehension — understand what a prompt is trying to do and whether the model’s response achieves it
AI-assisted learning — use Claude or ChatGPT to understand evaluation concepts, Python syntax, and statistical terms you encounter
Basic scripting — run Python scripts from the command line, modify parameters, read output

Self-Evaluation Checklist

I can follow a detailed annotation rubric and maintain over 90% agreement with gold standard labels
I understand what LLM API parameters (temperature, top-p, max tokens) do and how they affect output
I can run an eval script, read the output, and explain what the metrics mean
I document issues clearly — specific examples, not vague complaints
I can write basic Python — loops, functions, file I/O, working with JSON
I’ve labeled 500+ data points and understand inter-annotator agreement
I ask questions when guidelines are ambiguous instead of guessing

Training Curriculum

Month 1: Foundations

Annotation Boot Camp — intensive training on annotation guidelines, rubric interpretation, and consistency. Practice on real datasets with feedback from L3+ mentors.
LLM API Fundamentals — hands-on exercises calling OpenAI, Anthropic, and open-source model APIs. Understand request/response structure, error handling, rate limits.
Python for Evaluators — not software engineering Python. Evaluator Python: reading CSVs, parsing JSON, running scripts, basic data manipulation with pandas.
Evaluation Concepts — what are precision, recall, F1? What is inter-annotator agreement? What does a confusion matrix tell you? Build intuition, not just memorization.

Month 2-3: Applied Practice

Real Annotation Projects — work on actual client annotation tasks under supervision. Your labels get reviewed by L3+ engineers.
Eval Script Reading — read and understand existing eval pipelines. Trace the logic. Modify parameters. Run experiments.
Output Comparison Exercises — given two model outputs for the same prompt, which is better? Why? Write it up. Get feedback.
Failure Mode Catalog — start building a personal catalog of ways models fail. Hallucinations, refusals, format errors, subtle factual mistakes, instruction-following failures.

Ranking Standard

Metric	Threshold	How It’s Measured
Annotation accuracy	90%+ agreement with gold standard	Spot-check audits
Label consistency	Cohen’s kappa above 0.8 with senior annotators	Inter-annotator agreement
Script execution	Can run eval scripts independently	Mentor observation
Documentation quality	Issues flagged with specific examples	Review of notes
Python basics	Can modify script parameters and read output	Practical assessment

Promotion to L2

Requirements

Minimum 3 months at L1
Pass L2 qualification assessment:
- Annotation accuracy test — label a held-out dataset. Measured against gold standard and senior annotator consensus.
- API comprehension — demonstrate understanding of LLM API parameters and their effects on output quality.
- Script proficiency — given an eval script, explain what it does, modify a parameter, run it, and interpret the results.
- Issue documentation — present 5 well-documented issues you flagged during annotation or eval runs.
Mentor confirmation of readiness
Consistent attendance and engagement in training sessions

What the Panel Looks For

Precision — are they careful? Do they catch details? Sloppy annotators become sloppy evaluators.
Curiosity — do they ask why the model did something, or just label it wrong and move on?
Reliability — do they show up, follow through, and meet deadlines?
Learning velocity — are they measurably better now than when they started?

Mentorship at This Level

You receive: L3+ mentor, weekly check-ins. Focus on annotation quality, Python fundamentals, and building evaluation intuition.
You give: Nothing yet — focus on learning. But share interesting failure modes you find with the team.
Exposure: Observe L3+ eval framework design sessions. You’re not contributing yet, but start understanding how evaluation systems get built.

What Unlocks at L2

Discount billing rate — you start generating revenue
Test case writing — you design tests, not just run them
More independence — less supervision, more ownership of annotation tasks
First steps toward eval framework understanding

← All Levels