Skip to main content
This page defines the core terms used in the AI engineering workflow. Understanding these concepts is the first step toward building robust and reliable generative AI capabilities.

AI engineering lifecycle

The concepts in AI engineering are best understood within the context of the development lifecycle. While AI capabilities can become highly sophisticated, they typically start simple and evolve through a disciplined, iterative process:
1

Prototype a capability

Development starts by defining a task and prototyping a with a prompt to solve it.
2

Evaluate with ground truth

The prototype is then tested against a of reference examples (so called “”) to measure its quality and effectiveness using . This process is known as an .
3

Observe in production

Once a capability meets quality benchmarks, it’s deployed. In production, scorers can be applied to live traffic () to monitor performance and cost in real-time.
4

Iterate with new insights

Insights from production monitoring reveal edge cases and opportunities for improvement. These new examples are used to refine the capability, expand the ground truth collection, and begin the cycle anew.

AI engineering terms

Capability

A generative AI capability is a system that uses large language models to perform a specific task by transforming inputs into desired outputs. Capabilities exist on a spectrum of complexity, ranging from simple to sophisticated architectures:
  • Single-turn model interactions: A single prompt and response, such as classifying a support ticket’s intent or summarizing a document.
  • Workflows: Multi-step processes where each step’s output feeds into the next, such as research → analysis → report generation.
  • Single-agent: An agent that can reasons and make decisions to accomplish a goal, such as a customer support agent that can search documentation, check order status, and draft responses.
  • Multi-agent: Multiple specialized agents collaborating to solve complex problems, such as software engineering through architectural planning, coding, testing, and review.

Collection

A collection is a curated set of reference records used for development, testing, and evaluation of a capability. Collections serve as the test cases for prompt engineering.

Collection record

Collection records are the individual input-output pairs within a collection. Each record consists of an input and its corresponding expected output (ground truth).

Ground truth

Ground truth is the validated, expert-approved correct output for a given input. It represents the gold standard that the AI capability should aspire to match.

Scorer

A scorer is a function that evaluates a capability’s output. It programmatically assesses quality by comparing the generated output against ground truth or other criteria, returning a score.

Evaluation or “eval”

An evaluation, or eval, is the process of testing a capability against a collection of ground truth data using one or more scorers. An eval runs the capability on every record in the collection and reports metrics like accuracy, pass-rate, and cost. Evals are typically run before deployment to benchmark performance.

Flag

A flag is a configuration parameter that controls how your AI capability behaves. Flags let you parameterize aspects like model choice, tool availability, prompting strategies, or retrieval approaches. By defining flags, you can run experiments to compare different configurations and systematically determine which approach performs best.

Experiment

An experiment is an evaluation run with a specific set of flag values. By running multiple experiments with different flag configurations, you can compare performance across different models, prompts, or strategies to find the optimal setup for your capability.

Online evaluation

An online evaluation is the process of applying a scorer to a capability’s live production traffic. This provides real-time feedback on performance degradation, cost, and quality drift, enabling continuous monitoring and improvement.

Annotation

Annotations are expert-provided observations, labels, or corrections added to production traces or evaluation results. Domain experts review AI capability runs and document what went wrong, what should have happened differently, or categorize failure modes. These annotations help identify patterns in capability failures, validate scorer accuracy, and create new test cases for collections.

User feedback

User feedback is direct signal from end users about AI capability performance, typically collected through ratings (thumbs up/down, stars) or text comments. Feedback events are associated with traces to provide context about both system behavior and user perception. Aggregated feedback reveals quality trends, helps prioritize improvements, and surfaces issues that might not appear in evaluations.

What’s next?

Now that you understand the core concepts, get started with the Quickstart or dive into Evaluate to learn about systematic testing.