Monitoring and Troubleshooting with Metrics
This feature is available in the Enterprise Edition only.
This page provides practical guidance for monitoring Agent Mesh using OpenTelemetry metrics, including dashboard examples, alert rules, best practices, and troubleshooting.
For an introduction to application metrics and instructions for enabling them, see Application Metrics with OpenTelemetry. For details about available metrics and configuration options, see Configuring OpenTelemetry Metrics. For integration setup, see Integrating OpenTelemetry Metrics.
Common Monitoring Scenarios
This section provides practical guidance for common monitoring scenarios using the available metrics.
Agent Health and Utilization
Use the sam.operation.duration metric to monitor agent health and performance:
Key Insights:
- Monitor request rates per agent to identify busy or underutilized agents
- Track latency percentiles (P50, P95, P99) to understand agent response time characteristics
- Calculate success rates by comparing successful requests (
error_type="none") to total requests - Identify agents with degraded performance by comparing current latency to historical baselines
Visualizing in Grafana:
- Create a new panel and select your Prometheus datasource
- Enter the query for P95 latency by agent:
histogram_quantile(0.95, sum(rate(sam_operation_duration_bucket{type="agent"}[5m])) by (component_name, le))
- Set the visualization type to "Time series"
- Configure display:
- Legend: Show component_name
- Y-axis: Seconds
- Add threshold lines at your SLO values (for example, 5s warning, 10s critical)
- Add a second query for request rate:
sum(rate(sam_operation_duration_count{type="agent"}[5m])) by (component_name)
Visualizing in DataDog:
- Navigate to Dashboards and select New Dashboard
- Add a Timeseries widget
- Use query:
p95:sam.operation.duration{type:agent} by {component_name} - Set display type to "Line"
- Add monitor thresholds to visualize SLO boundaries
- Add a second widget showing request volume:
rate(sam.operation.duration.count{type:agent} by {component_name})
Alert Recommendations: Alert when agent success rate drops below 95% or when latency percentiles exceed established SLO thresholds.
LLM Cost and Token Tracking
Use the sam.gen_ai.tokens.used counter metric to track token consumption and estimate costs:
Key Insights:
- Sum token counts by model and token type (input versus output) over time periods
- Calculate estimated costs by multiplying token counts by model pricing (pricing varies by provider and model)
- Identify high-cost agents or models by grouping token usage by
component_nameandgen_ai_request_model - Track token usage trends over time to forecast budgets and identify usage spikes
Alert Recommendations: Set budget alerts based on daily or monthly token consumption thresholds. Alert when costs exceed expected values or when usage patterns change unexpectedly.
Gateway Performance
Use sam.gateway.duration and sam.gateway.ttfb.duration metrics to monitor gateway performance:
Key Insights:
- Monitor latency percentiles by endpoint to identify slow operations
- Track request rates by endpoint to understand traffic patterns
- Analyze time-to-first-byte (TTFB) for streaming endpoints to optimize perceived responsiveness
- Calculate error rates by comparing failed requests to total requests
Example Queries:
Prometheus: Gateway P95 latency by endpoint:
histogram_quantile(0.95, sum(rate(sam_gateway_duration_bucket[5m])) by (gateway_name, operation_name, le))
DataDog: Gateway error rate:
sum:sam.gateway.requests{error_type:4xx_error OR error_type:5xx_error}.as_rate() / sum:sam.gateway.requests{*}.as_rate()
Alert Recommendations: Alert when gateway latency exceeds SLOs (for example, P95 > 500ms) or when error rates exceed acceptable thresholds (typically 1-5%).
Database Performance
Use the sam.db.duration metric to monitor database operations:
Key Insights:
- Monitor latency percentiles by collection and operation type to identify slow queries
- Track query volume by collection to understand database load patterns
- Identify collections with degraded performance by comparing current latency to historical baselines
- Analyze latency distributions to detect bimodal patterns (for example, cache hits versus misses)
Alert Recommendations: Alert when P99 database latency exceeds 100ms or when specific collections show sustained performance degradation.
Error Rates and Troubleshooting
Use the error_type label available on most metrics to analyze failures:
Key Insights:
- Calculate overall error rates by dividing failed operations by total operations
- Break down errors by component type, operation, or dependency to identify failure sources
- Track LLM provider errors to detect rate limiting or service degradation
- Monitor dependency errors to identify integration issues with external services
Alert Recommendations: Alert when error rates exceed baseline thresholds (typically 1-5%) or when new error types appear that were not previously observed.
Creating Dashboards and Alerts
After you enable metrics collection and integrate with your observability platform, create dashboards and alerts to monitor Agent Mesh health and performance.
Example Dashboard Panels
This section provides examples for creating dashboard panels in common observability platforms.
Grafana with Prometheus
Panel 1: Agent Latency Percentiles by Component
Query:
histogram_quantile(0.95, sum(rate(sam_operation_duration_bucket{type="agent"}[5m])) by (component_name, le))
Visualization type: Time series
Display: Multiple lines (one per agent)
Threshold markers: Warning at 5s, Critical at 10s
Panel 2: LLM Token Consumption Rate
Query:
sum(rate(sam_gen_ai_tokens_used[5m])) by (gen_ai_request_model, gen_ai_token_type)
Visualization type: Stacked area chart
Legend: Show model and token type
Panel 3: Gateway Error Rate
Query:
sum(rate(sam_gateway_requests{error_type!="none"}[5m])) / sum(rate(sam_gateway_requests[5m]))
Visualization type: Time series
Format: Percentage
Threshold markers: Warning at 1%, Critical at 5%
Panel 4: Database Operation Latency by Collection
Query:
histogram_quantile(0.99, sum(rate(sam_db_duration_bucket[5m])) by (db_collection_name, le))
Visualization type: Bar gauge
Display: Current value per collection
DataDog
Agent Performance Dashboard
- Create a new dashboard in DataDog
- Add a Timeseries widget with query:
p95:sam.operation.duration{type:agent} by {component_name}
- Set visualization to Line graph
- Add monitor threshold overlay at your SLO value
Cost Tracking Dashboard
- Add a Query Value widget with query:
sum:sam.gen_ai.cost.total{*}.rollup(sum, 86400)
- Format as currency
- Add a Timeseries widget showing cost trends over 30 days
Defining Alert Rules
Prometheus Alert Rules
Add these rules to your Prometheus configuration:
groups:
- name: agent_mesh_alerts
interval: 30s
rules:
# Alert when agent P95 latency exceeds 10 seconds
- alert: HighAgentLatency
expr: histogram_quantile(0.95, sum(rate(sam_operation_duration_bucket{type="agent"}[5m])) by (component_name, le)) > 10
for: 5m
labels:
severity: warning
component: agent_mesh
annotations:
summary: "Agent {{ $labels.component_name }} P95 latency exceeds 10s"
description: "Agent {{ $labels.component_name }} has P95 latency of {{ $value }}s, exceeding the 10s threshold."
# Alert when gateway error rate exceeds 5%
- alert: HighGatewayErrorRate
expr: |
sum(rate(sam_gateway_requests{error_type!="none"}[5m])) by (gateway_name)
/
sum(rate(sam_gateway_requests[5m])) by (gateway_name)
> 0.05
for: 5m
labels:
severity: critical
component: agent_mesh
annotations:
summary: "Gateway {{ $labels.gateway_name }} error rate exceeds 5%"
description: "Gateway {{ $labels.gateway_name }} has {{ $value | humanizePercentage }} error rate."
# Alert when LLM latency is abnormally high
- alert: HighLLMLatency
expr: histogram_quantile(0.95, sum(rate(sam_gen_ai_client_operation_duration_bucket[5m])) by (gen_ai_request_model, le)) > 30
for: 10m
labels:
severity: warning
component: agent_mesh
annotations:
summary: "LLM {{ $labels.gen_ai_request_model }} P95 latency exceeds 30s"
description: "Model {{ $labels.gen_ai_request_model }} has P95 latency of {{ $value }}s."
# Alert when database operations are slow
- alert: SlowDatabaseOperations
expr: histogram_quantile(0.99, sum(rate(sam_db_duration_bucket[5m])) by (db_collection_name, le)) > 1.0
for: 5m
labels:
severity: warning
component: agent_mesh
annotations:
summary: "Database collection {{ $labels.db_collection_name }} P99 latency exceeds 1s"
description: "Collection {{ $labels.db_collection_name }} has P99 latency of {{ $value }}s."
# Alert when daily LLM costs exceed budget
- alert: LLMCostBudgetExceeded
expr: sum(increase(sam_gen_ai_cost_total[24h])) > 100
labels:
severity: info
component: agent_mesh
annotations:
summary: "Daily LLM costs exceed $100"
description: "Total LLM costs in the last 24 hours: ${{ $value }}."
DataDog Monitors
Create monitors in DataDog using the metric explorer or API:
High Agent Latency Monitor:
- Metric:
p95:sam.operation.duration{type:agent} - Alert threshold: Above 10 for five minutes
- Group by:
component_name - Notification: Alert when any agent exceeds threshold
Gateway Error Rate Monitor:
- Metric: Custom query combining error and total request rates
- Alert threshold: Above 5% for five minutes
- Group by:
gateway_name
Cost Budget Monitor:
- Metric:
sum:sam.gen_ai.cost.total{*}.rollup(sum, 86400) - Alert threshold: Above daily budget value
- Notification: Warning when approaching budget, critical when exceeded
Choosing Alert Thresholds
When defining alert thresholds, consider these factors:
Baseline Performance: Establish baseline metrics from normal operation before setting thresholds. What is typical P95 latency for your agents? What is your normal error rate?
SLO Alignment: Align alert thresholds with your service-level objectives. If you commit to 99.9% uptime, alert before you risk breaching that SLO.
Alert Fatigue: Set thresholds that indicate genuine problems, not normal variance. Use the for: clause in Prometheus rules to require sustained threshold violations before alerting.
Severity Levels: Use multiple severity levels (info, warning, critical) to distinguish between "awareness needed" and "immediate action required."
Best Practices
This section provides guidance on operating Agent Mesh observability in production environments.
Metric Cardinality Management
High cardinality is one of the most common challenges in metrics-based observability. Every unique combination of label values creates a separate time series, and excessive cardinality can lead to increased storage costs and query performance degradation.
Monitor cardinality: Use your observability platform's cardinality analysis tools to understand which metrics and labels contribute most to cardinality. DataDog provides cardinality dashboards that help identify high-cardinality metrics.
Exclude high-cardinality labels: If a label creates excessive cardinality without providing operational value, exclude it using the exclude_labels configuration. Common culprits include user identifiers, session IDs, or request IDs. For more information, see Controlling Metric Cardinality.
Aggregate at query time: Instead of creating metrics with high-cardinality labels, store metrics with lower-cardinality labels and perform aggregation or filtering at query time when needed.
Bucket Configuration for Different Workloads
The default bucket configurations are optimized for typical production workloads, but you may need to adjust them based on your specific performance characteristics:
Default bucket configurations are optimized for typical workloads but may not match your specific performance characteristics. Review actual latency distributions before deploying to production and adjust buckets accordingly.
Low-latency services: For components with consistently low latency (such as local database operations), use finer-grained buckets in the low-latency range: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1].
High-latency services: For components with naturally higher latency (such as LLM inference or complex workflows), use coarser buckets with higher upper bounds: [1.0, 5.0, 10.0, 30.0, 60.0, 120.0].
Bimodal distributions: If you observe bimodal latency distributions (for example, cache hits versus cache misses), ensure your bucket configuration has sufficient resolution in both ranges.
Iterate based on observation: Start with default buckets and adjust based on actual latency distributions. If most observations fall into a single bucket, you need finer granularity. If many buckets remain empty, you can reduce overhead with coarser granularity.
For more information about customizing buckets, see Customizing Histogram Buckets.
Troubleshooting
This section addresses common issues you might encounter when configuring or using application metrics.
Metrics Endpoint Not Accessible
Symptom: Accessing the /metrics endpoint returns a connection error or 404.
Possible Causes:
- Observability is not enabled in configuration
- The management server is not running or is configured on a different port
- A firewall or network policy blocks access to the management server port
Resolution:
- Verify that
observability.enabledis set totruein your component configuration - Check the management server configuration and verify the port number
- Use
curlfrom the same host to verify local access:curl http://localhost:8080/metrics - Check firewall rules and Kubernetes network policies to ensure the port is accessible
No Metrics Appearing
Symptom: The /metrics endpoint is accessible but returns no metrics or only default process metrics.
Possible Causes:
- No operations have been performed yet to generate metrics
- All metric families have been disabled by setting
exclude_labels: ['*'] - The component has not been restarted after configuration changes
Resolution:
- Trigger some activity (send requests, invoke agents) to generate metric observations
- Review your
distribution_metricsconfiguration to ensure you have not disabled all metrics - Restart the component to apply configuration changes
- Check application logs for errors related to metrics initialization
High Cardinality Issues
Symptom: Your observability platform reports high cardinality or shows query performance degradation.
Possible Causes:
- High-cardinality labels are not excluded (such as
owner.idor session identifiers) - Many unique agents, tools, or components create numerous label combinations
- Error types proliferate because of diverse failure modes
Resolution:
- Use your observability platform's cardinality analysis tools to identify problematic labels
- Exclude high-cardinality labels using the
exclude_labelsconfiguration - Consider whether you need per-component granularity or whether aggregating at the type level (
agent,tool) is sufficient - Consolidate error types into broader categories if specific error messages create excessive cardinality
Bucket Configuration Not Applied
Symptom: Custom bucket configurations do not appear in the metrics output; Agent Mesh uses default buckets instead.
Possible Causes:
- Configuration syntax errors prevent Agent Mesh from parsing the custom configuration
- The metric family name is incorrect or does not match the expected format
- The component has not been restarted after configuration changes
Resolution:
- Verify that the metric family name in your configuration matches the documented name (without the
sam_prefix) - Check application logs for configuration parsing errors
- Restart the component to apply configuration changes
- Use the
/metricsendpoint to verify that the expected bucket boundaries appear in the histogram metrics
Missing Labels in Metrics
Symptom: Expected labels are missing from metric output.
Possible Causes:
- Labels have been excluded in the
exclude_labelsconfiguration - The component or operation does not provide that label (for example, not all operations have a
db_collection_name) - Label values are empty or null and Agent Mesh omits them from output
Resolution:
- Review your
exclude_labelsconfiguration to ensure you have not inadvertently excluded required labels - Verify that the label applies to the specific metric family and operation you are observing
- Check application logs for warnings about missing or null label values
Next Steps
After you configure dashboards and alerts, consider these next steps:
- Integrate with incident response: Connect your metrics-based alerts to your incident management system (PagerDuty, Opsgenie, and similar services)
- Optimize cardinality: Review cardinality in your observability platform and exclude high-cardinality labels that do not provide operational value
- Tune bucket configurations: After collecting baseline metrics, adjust histogram buckets to match your specific latency characteristics
- Set up log correlation: Configure OTLP log export to correlate metrics with application logs for troubleshooting
Additional Resources
For more information about Agent Mesh observability and related topics, see:
- Application Metrics with OpenTelemetry: Overview of application metrics, key concepts, and getting started guide
- Configuring OpenTelemetry Metrics: Complete reference for metric families and configuration
- Integrating OpenTelemetry Metrics: OTLP exporter setup and DataDog quick start
- Monitoring Your Agent Mesh: Overview of runtime observability features including activity viewer, broker monitoring, and stimulus logs
- Health Checks: Kubernetes-compatible health check endpoints for liveness, readiness, and startup probes
- Logging Configuration: Application logging configuration and best practices
- OpenTelemetry Documentation: Official OpenTelemetry project documentation