Use CALM as Your Operations Advisor

🟡 Difficulty: Intermediate | ⏱️ Time: 45-60 minutes

Overview

Use CALM Chat mode as an expert operations advisor to troubleshoot issues in your e-commerce platform. With rich metadata, your architecture becomes living support documentation.

Learning Objectives

By the end of this tutorial, you will:

Add operational metadata (owners, health endpoints, failure modes, runbooks) to architecture nodes
Document flow-level business impact and SLAs
Add monitoring metadata at the architecture level
Use CALM Chat mode to troubleshoot three simulated outage scenarios
Understand why architecture-as-documentation outperforms static wikis

Prerequisites

Complete Use CALM as Your Architecture Advisor first.

Step-by-Step Guide

1. Understand Architecture as Operations Documentation

Your CALM architecture already contains:

Nodes: What services exist, their types, and criticality
Relationships: How services connect and depend on each other
Flows: Business processes that traverse your system
Controls: SLAs and compliance requirements

By adding operational metadata, you transform this into queryable support documentation:

Ownership and escalation contacts
Health check endpoints
Common failure modes and remediation steps
Runbook links and troubleshooting guides

Unlike static wikis, this documentation lives with the architecture and stays in sync.

Side note: At the moment we are putting a lot of information into the metadata section, but there are more advanced techniques that can be used to do this in a more structured way.

2. Add Operational Metadata to Nodes

Open your architectures/ecommerce-platform.json.

Prompt:

Add operational metadata to my e-commerce architecture nodes. For each service (load-balancer, api-gateway-1, api-gateway-2, order-service, inventory-service, payment-service), add metadata including:

"owner": team name responsible (e.g., "platform-team", "payments-team")
"oncall-slack": Slack channel for incidents (e.g., "#oncall-platform")
"health-endpoint": Health check URL (e.g., "/health" or "/actuator/health")
"runbook": Link to runbook (e.g., "https://runbooks.example.com/order-service")
"tier": Service tier for prioritization ("tier-1", "tier-2", "tier-3")
"dependencies": Array of critical upstream/downstream services

Also add to the databases:
"backup-schedule": When backups run
"restore-time": Expected restore duration
"dba-contact": DBA team contact

3. Add Failure Mode Metadata

Prompt:

Add failure mode documentation to the order-service node metadata:

"failure-modes": [
  {
    "symptom": "HTTP 503 errors",
    "likely-cause": "Database connection pool exhausted",
    "check": "Check connection pool metrics in Grafana dashboard",
    "remediation": "Scale up service replicas or increase pool size",
    "escalation": "If persists > 5min, page DBA team"
  },
  {
    "symptom": "High latency (>2s p99)",
    "likely-cause": "Payment service degradation",
    "check": "Check payment-service health and circuit breaker status",
    "remediation": "Circuit breaker should open automatically; check fallback queue",
    "escalation": "Contact payments-team if circuit breaker not triggering"
  },
  {
    "symptom": "Order validation failures",
    "likely-cause": "Inventory service returning stale data",
    "check": "Verify inventory-service cache TTL and database replication lag",
    "remediation": "Clear inventory cache; check replica sync status",
    "escalation": "Contact platform-team for cache issues"
  }
]

Add similar failure modes for payment-service and inventory-service.

4. Add Flow-Level Incident Metadata

Prompt:

Add incident metadata to my business flows:

For order-processing-flow:
- "business-impact": "Customers cannot complete purchases - direct revenue loss"
- "degraded-behavior": "Orders queue in message broker; processed when service recovers"
- "customer-communication": "Display 'Order processing delayed' message"
- "sla": "99.9% availability, 30s p99 latency"

For inventory-check-flow:
- "business-impact": "Stock levels may be inaccurate - risk of overselling"
- "degraded-behavior": "Fall back to cached inventory; flag orders for manual review"
- "customer-communication": "Display 'Stock availability being confirmed'"
- "sla": "99.5% availability, 500ms p99 latency"

5. Add Monitoring and Alerting Metadata

Prompt:

Add monitoring metadata to the architecture level:

"monitoring": {
  "grafana-dashboard": "https://grafana.example.com/d/ecommerce-overview",
  "kibana-logs": "https://kibana.example.com/app/discover#/ecommerce-*",
  "pagerduty-service": "https://pagerduty.example.com/services/ECOMMERCE",
  "statuspage": "https://status.example.com",
  "metrics-retention": "30 days",
  "log-retention": "90 days"
}

Add service-specific dashboards in each node's metadata:
- "dashboard": Link to service-specific Grafana dashboard
- "log-query": Pre-built Kibana query for this service
- "alerts": Array of PagerDuty alert names that fire for this service

6. Validate the Enriched Architecture

calm validate -a architectures/ecommerce-platform.json

Metadata doesn't affect validation — it passes through as documentation. ✅

7. Simulate an Outage: Payment Service Down

Scenario: Order completion rate has dropped 80%. Customers can't checkout.

Prompt:

I'm receiving alerts that order completion rate has dropped 80%. Customers report checkout failures.

Based on my e-commerce architecture:
What services are involved in the checkout/order flow?
What are the most likely failure points?
What health endpoints should I check first?
Who should I contact if this is a payment issue?
What's the business impact and customer communication plan?

CALM should respond using your architecture's metadata — identifying the order-processing-flow, services involved, health endpoints, and escalation contacts.

8. Simulate an Outage: Database Latency Spike

Scenario: Order-service latency has spiked to 5 seconds. No errors, just slow.

Prompt:

Order-service latency has spiked to 5 seconds (normally <200ms). No errors in logs.

Using my architecture:
What databases does order-service connect to?
Could this be a database replication issue?
What are the known failure modes for high latency?
What are the remediation steps?
What's the DBA contact for the order database?

9. Simulate an Outage: Cascade Failure Investigation

Scenario: Multiple services are showing errors. You need to find the root cause.

Prompt:

I'm seeing errors across order-service, inventory-service, and the web-frontend. It started 10 minutes ago.

Analyze my architecture to help identify the root cause:
What's the dependency graph between these services?
What shared infrastructure could cause all three to fail?
Is there a single point of failure that could explain this?
In what order should I investigate?
Based on the flow definitions, which business processes are affected?

CALM should identify the API Gateway or load balancer as the likely shared failure point.

Use git to lock in the progress you've made in this lesson before continuing.

Key Concepts

Operational Metadata Properties

Property	Level	Purpose
`owner`	Node	Team responsible
`oncall-slack`	Node	Incident escalation channel
`health-endpoint`	Node	Health check URL
`runbook`	Node	Troubleshooting guide link
`tier`	Node	Priority classification (tier-1 to tier-3)
`failure-modes`	Node	Array of known failure scenarios
`business-impact`	Flow	Revenue/operations impact statement
`sla`	Flow	Availability and latency targets
`degraded-behavior`	Flow	How the system behaves under partial failure

Wiki vs Architecture-as-Documentation

Aspect	Traditional Wiki	CALM Architecture
Accuracy	Often stale	Always current (lives with code)
Discoverability	Search/hope	Query the model directly
Dependencies	Manually maintained	Derived from relationships
Impact analysis	Tribal knowledge	Computed from flows
Escalation paths	Buried in pages	Embedded in node metadata
Troubleshooting	Static runbooks	Context-aware AI assistance

Resources

Next Steps

In the next tutorial, you'll use docify to generate operations documentation — runbooks, on-call guides, and flow support guides — directly from your architecture metadata!

Overview​

Learning Objectives​

Prerequisites​

Step-by-Step Guide​

1. Understand Architecture as Operations Documentation​

2. Add Operational Metadata to Nodes​

3. Add Failure Mode Metadata​

4. Add Flow-Level Incident Metadata​

5. Add Monitoring and Alerting Metadata​

6. Validate the Enriched Architecture​

7. Simulate an Outage: Payment Service Down​

8. Simulate an Outage: Database Latency Spike​

9. Simulate an Outage: Cascade Failure Investigation​

Key Concepts​

Operational Metadata Properties​

Wiki vs Architecture-as-Documentation​

Resources​

Next Steps​

Overview

Learning Objectives

Prerequisites

Step-by-Step Guide

1. Understand Architecture as Operations Documentation

2. Add Operational Metadata to Nodes

3. Add Failure Mode Metadata

4. Add Flow-Level Incident Metadata

5. Add Monitoring and Alerting Metadata

6. Validate the Enriched Architecture

7. Simulate an Outage: Payment Service Down

8. Simulate an Outage: Database Latency Spike

9. Simulate an Outage: Cascade Failure Investigation

Key Concepts

Operational Metadata Properties

Wiki vs Architecture-as-Documentation

Resources

Next Steps