Skip to main content

Use CALM as Your Operations Advisor

🟡 Difficulty: Intermediate | ⏱️ Time: 45-60 minutes

Overview

Use CALM Chat mode as an expert operations advisor to troubleshoot issues in your e-commerce platform. With rich metadata, your architecture becomes living support documentation.

Learning Objectives

By the end of this tutorial, you will:

  • Add operational metadata (owners, health endpoints, failure modes, runbooks) to architecture nodes
  • Document flow-level business impact and SLAs
  • Add monitoring metadata at the architecture level
  • Use CALM Chat mode to troubleshoot three simulated outage scenarios
  • Understand why architecture-as-documentation outperforms static wikis

Prerequisites

Complete Use CALM as Your Architecture Advisor first.

Step-by-Step Guide

1. Understand Architecture as Operations Documentation

Your CALM architecture already contains:

  • Nodes: What services exist, their types, and criticality
  • Relationships: How services connect and depend on each other
  • Flows: Business processes that traverse your system
  • Controls: SLAs and compliance requirements

By adding operational metadata, you transform this into queryable support documentation:

  • Ownership and escalation contacts
  • Health check endpoints
  • Common failure modes and remediation steps
  • Runbook links and troubleshooting guides

Unlike static wikis, this documentation lives with the architecture and stays in sync.

Side note: At the moment we are putting a lot of information into the metadata section, but there are more advanced techniques that can be used to do this in a more structured way.

2. Add Operational Metadata to Nodes

Open your architectures/ecommerce-platform.json.

Prompt:

Add operational metadata to my e-commerce architecture nodes. For each service (load-balancer, api-gateway-1, api-gateway-2, order-service, inventory-service, payment-service), add metadata including:

1. "owner": team name responsible (e.g., "platform-team", "payments-team")
2. "oncall-slack": Slack channel for incidents (e.g., "#oncall-platform")
3. "health-endpoint": Health check URL (e.g., "/health" or "/actuator/health")
4. "runbook": Link to runbook (e.g., "https://runbooks.example.com/order-service")
5. "tier": Service tier for prioritization ("tier-1", "tier-2", "tier-3")
6. "dependencies": Array of critical upstream/downstream services

Also add to the databases:
1. "backup-schedule": When backups run
2. "restore-time": Expected restore duration
3. "dba-contact": DBA team contact

3. Add Failure Mode Metadata

Prompt:

Add failure mode documentation to the order-service node metadata:

"failure-modes": [
{
"symptom": "HTTP 503 errors",
"likely-cause": "Database connection pool exhausted",
"check": "Check connection pool metrics in Grafana dashboard",
"remediation": "Scale up service replicas or increase pool size",
"escalation": "If persists > 5min, page DBA team"
},
{
"symptom": "High latency (>2s p99)",
"likely-cause": "Payment service degradation",
"check": "Check payment-service health and circuit breaker status",
"remediation": "Circuit breaker should open automatically; check fallback queue",
"escalation": "Contact payments-team if circuit breaker not triggering"
},
{
"symptom": "Order validation failures",
"likely-cause": "Inventory service returning stale data",
"check": "Verify inventory-service cache TTL and database replication lag",
"remediation": "Clear inventory cache; check replica sync status",
"escalation": "Contact platform-team for cache issues"
}
]

Add similar failure modes for payment-service and inventory-service.

4. Add Flow-Level Incident Metadata

Prompt:

Add incident metadata to my business flows:

For order-processing-flow:
- "business-impact": "Customers cannot complete purchases - direct revenue loss"
- "degraded-behavior": "Orders queue in message broker; processed when service recovers"
- "customer-communication": "Display 'Order processing delayed' message"
- "sla": "99.9% availability, 30s p99 latency"

For inventory-check-flow:
- "business-impact": "Stock levels may be inaccurate - risk of overselling"
- "degraded-behavior": "Fall back to cached inventory; flag orders for manual review"
- "customer-communication": "Display 'Stock availability being confirmed'"
- "sla": "99.5% availability, 500ms p99 latency"

5. Add Monitoring and Alerting Metadata

Prompt:

Add monitoring metadata to the architecture level:

"monitoring": {
"grafana-dashboard": "https://grafana.example.com/d/ecommerce-overview",
"kibana-logs": "https://kibana.example.com/app/discover#/ecommerce-*",
"pagerduty-service": "https://pagerduty.example.com/services/ECOMMERCE",
"statuspage": "https://status.example.com",
"metrics-retention": "30 days",
"log-retention": "90 days"
}

Add service-specific dashboards in each node's metadata:
- "dashboard": Link to service-specific Grafana dashboard
- "log-query": Pre-built Kibana query for this service
- "alerts": Array of PagerDuty alert names that fire for this service

6. Validate the Enriched Architecture

calm validate -a architectures/ecommerce-platform.json

Metadata doesn't affect validation — it passes through as documentation. ✅

7. Simulate an Outage: Payment Service Down

Scenario: Order completion rate has dropped 80%. Customers can't checkout.

Prompt:

I'm receiving alerts that order completion rate has dropped 80%. Customers report checkout failures.

Based on my e-commerce architecture:
1. What services are involved in the checkout/order flow?
2. What are the most likely failure points?
3. What health endpoints should I check first?
4. Who should I contact if this is a payment issue?
5. What's the business impact and customer communication plan?

CALM should respond using your architecture's metadata — identifying the order-processing-flow, services involved, health endpoints, and escalation contacts.

8. Simulate an Outage: Database Latency Spike

Scenario: Order-service latency has spiked to 5 seconds. No errors, just slow.

Prompt:

Order-service latency has spiked to 5 seconds (normally <200ms). No errors in logs.

Using my architecture:
1. What databases does order-service connect to?
2. Could this be a database replication issue?
3. What are the known failure modes for high latency?
4. What are the remediation steps?
5. What's the DBA contact for the order database?

9. Simulate an Outage: Cascade Failure Investigation

Scenario: Multiple services are showing errors. You need to find the root cause.

Prompt:

I'm seeing errors across order-service, inventory-service, and the web-frontend. It started 10 minutes ago.

Analyze my architecture to help identify the root cause:
1. What's the dependency graph between these services?
2. What shared infrastructure could cause all three to fail?
3. Is there a single point of failure that could explain this?
4. In what order should I investigate?
5. Based on the flow definitions, which business processes are affected?

CALM should identify the API Gateway or load balancer as the likely shared failure point.

Use git to lock in the progress you've made in this lesson before continuing.

Key Concepts

Operational Metadata Properties

PropertyLevelPurpose
ownerNodeTeam responsible
oncall-slackNodeIncident escalation channel
health-endpointNodeHealth check URL
runbookNodeTroubleshooting guide link
tierNodePriority classification (tier-1 to tier-3)
failure-modesNodeArray of known failure scenarios
business-impactFlowRevenue/operations impact statement
slaFlowAvailability and latency targets
degraded-behaviorFlowHow the system behaves under partial failure

Wiki vs Architecture-as-Documentation

AspectTraditional WikiCALM Architecture
AccuracyOften staleAlways current (lives with code)
DiscoverabilitySearch/hopeQuery the model directly
DependenciesManually maintainedDerived from relationships
Impact analysisTribal knowledgeComputed from flows
Escalation pathsBuried in pagesEmbedded in node metadata
TroubleshootingStatic runbooksContext-aware AI assistance

Resources

Next Steps

In the next tutorial, you'll use docify to generate operations documentation — runbooks, on-call guides, and flow support guides — directly from your architecture metadata!