Axiom Docs

The Iterate stage closes the loop in AI engineering. By analyzing production performance, validating changes through evaluation, and deploying improvements with confidence, you create a continuous cycle of data-driven enhancement.

The improvement loop

Successful AI engineering follows a systematic pattern:

Analyze production - Identify what needs improvement
Create test cases - Turn failures into ground truth examples
Experiment with changes - Test variations using flags
Validate improvements - Run evaluations to confirm progress
Deploy with confidence - Ship changes backed by data
Repeat - New production data feeds the next iteration

Identify what to improve

Start by understanding how your capability performs in production. The Axiom Console provides multiple signals to help you prioritize:

Production traces

Review traces in the Observe section to find:

Real-world inputs that caused failures or low-quality outputs
High-cost or high-latency interactions that need optimization
Unexpected tool calls or reasoning paths
Edge cases your evaluations didn’t cover

Filter to AI spans and examine the full interaction path, including model choices, token usage, and intermediate steps.

User feedback
Coming soon

User feedback capture is coming soon. Contact Axiom to join the design partner program.

User feedback provides direct signal about which interactions matter most to your customers. Axiom’s AI SDK will include lightweight functions to capture both explicit and implicit feedback as timestamped event data:

Explicit feedback includes direct user signals like thumbs up, thumbs down, and comments on AI-generated outputs.
Implicit feedback captures behavioral signals like copying generated text, regenerating responses, or abandoning interactions.

Because user feedback is stored as timestamped events linked to specific AI runs, you can easily correlate feedback with traces to understand exactly what went wrong and prioritize high-value failures over edge cases that rarely occur.

Domain expert annotations
Coming soon

Annotation workflows are coming soon. Contact Axiom to join the design partner program.

Axiom will provide a seamless workflow for domain experts to review production traces and identify patterns in AI capability failures. The Console will surface traces that warrant attention, such as those with negative user feedback or anomalous behavior, and provide an interface for reviewing conversations and annotating failure modes. Annotations can be categorized into failures modes to guide prioritization. For example:

Critical failures - Complete breakdowns like API outages, unhandled exceptions, or timeout errors
Quality degradation - Declining accuracy scores, increased hallucinations, or off-topic responses
Coverage gaps - Out-of-distribution inputs the system wasn’t designed to handle, like unexpected languages or domains
User dissatisfaction - Negative feedback on outputs that technically succeeded but didn’t meet user needs

This structured analysis helps teams coordinate improvement efforts, prioritize which failure modes to address first, and track patterns over time.

Create test cases from production

Once you’ve identified high-priority failures, turn them into test cases for your evaluation . Organizations typically maintain multiple collections for different scenarios, failure modes, or capability variants:

const newTestCases = [
  {
    input: { 
      // Real production input that failed
      subject: 'Refund request for order #12345',
      content: 'I need a refund because the product arrived damaged.' 
    },
    expected: { 
      category: 'refund_request',
      priority: 'high' 
    },
  },
];

Experiment with changes

Use flags to test different approaches without changing your code:

# Test with a more capable model
axiom eval ticket-classification --flag.model=gpt-4o

# Try a different temperature
axiom eval ticket-classification --flag.temperature=0.3

# Experiment with prompt variations
axiom eval ticket-classification --flag.promptStrategy=detailed

Run multiple experiments to understand the tradeoffs between accuracy, cost, and latency.

Validate improvements

Before deploying any change, validate it against your full test collection using baseline comparison:

# Run baseline evaluation
axiom eval ticket-classification
# Note the run ID: run_abc123xyz

# Make your changes (update prompt, adjust config, etc.)

# Run again with baseline comparison
axiom eval ticket-classification --baseline run_abc123xyz

The Console shows you exactly how your changes impact:

Accuracy: Did scores improve or regress?
Cost: Is it more or less expensive?
Latency: Is it faster or slower?

Only deploy changes that show clear improvements without unacceptable tradeoffs.

Deploy with confidence

Once your evaluations confirm an improvement, deploy the change to production. Because you’ve validated against ground truth data, you can ship with confidence that the new version handles both existing cases and the new failures you discovered. After deployment, return to the Observe stage to monitor performance and identify the next opportunity for improvement.

Best practices

Build your collections over time. Your evaluation collections should grow as you discover new failure modes. Each production issue that makes it through is an opportunity to strengthen your test coverage.
Track improvements systematically. Use baseline comparisons for every change. This creates a clear history of how your capability has improved and prevents regressions.
Prioritize high-impact changes. Focus on failures that affect many users or high-value interactions. Not every edge case deserves immediate attention.
Experiment before committing. Flags let you test multiple approaches quickly. Run several experiments to understand the solution space before making code changes.
Close the loop. The improvement cycle never ends. Each deployment generates new production data that reveals the next set of improvements to make.

What’s next?

To learn more about the evaluation framework that powers this improvement loop, see Evaluate. To understand how to capture rich telemetry from production, see Observe.

Platform overview

Send data

Console

AI engineering

Miscellaneous

Iterate

The improvement loop

Identify what to improve

Production traces

User feedback
Coming soon

Domain expert annotations
Coming soon

Create test cases from production

Experiment with changes

Validate improvements

Deploy with confidence

Best practices

What’s next?

Platform overview

Send data

Console

AI engineering

Miscellaneous

​The improvement loop

​Identify what to improve

​Production traces

​User feedback Coming soon

​Domain expert annotations Coming soon

​Create test cases from production

​Experiment with changes

​Validate improvements

​Deploy with confidence

​Best practices

​What’s next?

The improvement loop

Identify what to improve

Production traces

User feedback
Coming soon

Domain expert annotations
Coming soon

Create test cases from production

Experiment with changes

Validate improvements

Deploy with confidence

Best practices

What’s next?