Skip to main content
The Iterate stage closes the loop in AI engineering. By analyzing production performance, validating changes through evaluation, and deploying improvements with confidence, you create a continuous cycle of data-driven enhancement.

The improvement loop

Successful AI engineering follows a systematic pattern:
  1. Analyze production - Identify what needs improvement
  2. Create test cases - Turn failures into ground truth examples
  3. Experiment with changes - Test variations using flags
  4. Validate improvements - Run evaluations to confirm progress
  5. Deploy with confidence - Ship changes backed by data
  6. Repeat - New production data feeds the next iteration

Identify what to improve

Start by understanding how your capability performs in production. The Axiom Console provides multiple signals to help you prioritize:

Production traces

Review traces in the Observe section to find:
  • Real-world inputs that caused failures or low-quality outputs
  • High-cost or high-latency interactions that need optimization
  • Unexpected tool calls or reasoning paths
  • Edge cases your evaluations didn’t cover
Filter to AI spans and examine the full interaction path, including model choices, token usage, and intermediate steps.

User feedback
Coming soon

User feedback capture is coming soon. Contact Axiom to join the design partner program.
User feedback provides direct signal about which interactions matter most to your customers. Axiom’s AI SDK will include lightweight functions to capture both explicit and implicit feedback as timestamped event data:
  • Explicit feedback includes direct user signals like thumbs up, thumbs down, and comments on AI-generated outputs.
  • Implicit feedback captures behavioral signals like copying generated text, regenerating responses, or abandoning interactions.
Because user feedback is stored as timestamped events linked to specific AI runs, you can easily correlate feedback with traces to understand exactly what went wrong and prioritize high-value failures over edge cases that rarely occur.

Domain expert annotations
Coming soon

Annotation workflows are coming soon. Contact Axiom to join the design partner program.
Axiom will provide a seamless workflow for domain experts to review production traces and identify patterns in AI capability failures. The Console will surface traces that warrant attention, such as those with negative user feedback or anomalous behavior, and provide an interface for reviewing conversations and annotating failure modes. Annotations can be categorized into failures modes to guide prioritization. For example:
  • Critical failures - Complete breakdowns like API outages, unhandled exceptions, or timeout errors
  • Quality degradation - Declining accuracy scores, increased hallucinations, or off-topic responses
  • Coverage gaps - Out-of-distribution inputs the system wasn’t designed to handle, like unexpected languages or domains
  • User dissatisfaction - Negative feedback on outputs that technically succeeded but didn’t meet user needs
This structured analysis helps teams coordinate improvement efforts, prioritize which failure modes to address first, and track patterns over time.

Create test cases from production

Once you’ve identified high-priority failures, turn them into test cases for your evaluation . Organizations typically maintain multiple collections for different scenarios, failure modes, or capability variants:
const newTestCases = [
  {
    input: { 
      // Real production input that failed
      subject: 'Refund request for order #12345',
      content: 'I need a refund because the product arrived damaged.' 
    },
    expected: { 
      category: 'refund_request',
      priority: 'high' 
    },
  },
];

Experiment with changes

Use flags to test different approaches without changing your code:
# Test with a more capable model
axiom eval ticket-classification --flag.model=gpt-4o

# Try a different temperature
axiom eval ticket-classification --flag.temperature=0.3

# Experiment with prompt variations
axiom eval ticket-classification --flag.promptStrategy=detailed
Run multiple experiments to understand the tradeoffs between accuracy, cost, and latency.

Validate improvements

Before deploying any change, validate it against your full test collection using baseline comparison:
# Run baseline evaluation
axiom eval ticket-classification
# Note the run ID: run_abc123xyz

# Make your changes (update prompt, adjust config, etc.)

# Run again with baseline comparison
axiom eval ticket-classification --baseline run_abc123xyz
The Console shows you exactly how your changes impact:
  • Accuracy: Did scores improve or regress?
  • Cost: Is it more or less expensive?
  • Latency: Is it faster or slower?
Only deploy changes that show clear improvements without unacceptable tradeoffs.

Deploy with confidence

Once your evaluations confirm an improvement, deploy the change to production. Because you’ve validated against ground truth data, you can ship with confidence that the new version handles both existing cases and the new failures you discovered. After deployment, return to the Observe stage to monitor performance and identify the next opportunity for improvement.

Best practices

  • Build your collections over time. Your evaluation collections should grow as you discover new failure modes. Each production issue that makes it through is an opportunity to strengthen your test coverage.
  • Track improvements systematically. Use baseline comparisons for every change. This creates a clear history of how your capability has improved and prevents regressions.
  • Prioritize high-impact changes. Focus on failures that affect many users or high-value interactions. Not every edge case deserves immediate attention.
  • Experiment before committing. Flags let you test multiple approaches quickly. Run several experiments to understand the solution space before making code changes.
  • Close the loop. The improvement cycle never ends. Each deployment generates new production data that reveals the next set of improvements to make.

What’s next?

To learn more about the evaluation framework that powers this improvement loop, see Evaluate. To understand how to capture rich telemetry from production, see Observe.