AWS CloudWatch: Production Observability Platform
AWS CloudWatch observability provides the three pillars of observability — metrics, logs, and traces — in a single managed platform. Instead of stitching together multiple tools, CloudWatch offers an integrated experience for monitoring, alerting, and troubleshooting AWS workloads. Therefore, teams can build comprehensive observability without the operational overhead of self-managed monitoring infrastructure.
Effective observability goes beyond simple uptime monitoring. Moreover, it requires understanding system behavior through custom metrics, structured logging, distributed tracing, and intelligent alerting that reduces noise. Consequently, this guide covers practical patterns for building CloudWatch-based observability that actually helps you detect, diagnose, and resolve issues faster.
AWS CloudWatch Observability: Custom Metrics
While CloudWatch provides built-in metrics for AWS services, custom metrics capture application-specific behavior — request latency percentiles, business KPIs, queue depths, and error rates. Furthermore, the Embedded Metric Format (EMF) lets you publish metrics through log events, combining the flexibility of logs with the queryability of metrics.
// Embedded Metric Format — publish metrics through logs
import software.amazon.cloudwatchlogs.emf.logger.MetricsLogger;
import software.amazon.cloudwatchlogs.emf.model.DimensionSet;
import software.amazon.cloudwatchlogs.emf.model.Unit;
@Service
public class OrderMetrics {
private final MetricsLogger metricsLogger;
public void recordOrderProcessed(Order order, long durationMs) {
metricsLogger.putDimensions(
DimensionSet.of("Service", "OrderService", "Environment", "production")
);
metricsLogger.putMetric("OrderProcessingTime", durationMs, Unit.MILLISECONDS);
metricsLogger.putMetric("OrderAmount", order.getTotal().doubleValue(), Unit.NONE);
metricsLogger.putMetric("OrderCount", 1, Unit.COUNT);
metricsLogger.putProperty("orderId", order.getId());
metricsLogger.putProperty("customerId", order.getCustomerId());
metricsLogger.putProperty("status", order.getStatus().name());
metricsLogger.flush();
}
}
// CloudWatch metric math for derived metrics
// Error rate = errors / (errors + successes) * 100
// Dashboard widget JSON:
// {
// "metrics": [
// [ { "expression": "m1/(m1+m2)*100", "label": "Error Rate %", "id": "e1" } ],
// [ "MyApp", "ErrorCount", "Service", "OrderService", { "id": "m1", "visible": false } ],
// [ "MyApp", "SuccessCount", "Service", "OrderService", { "id": "m2", "visible": false } ]
// ]
// }Cardinality and the Cost of Custom Metrics
The single biggest trap teams fall into is metric cardinality. CloudWatch creates a distinct custom metric for every unique combination of namespace, metric name, and dimension values — and each one is billed individually. Therefore, adding a high-cardinality dimension such as customerId or requestId to a metric can explode a handful of metrics into millions overnight, with a bill to match.
The correct pattern, which EMF encourages, is to keep dimensions low-cardinality (service, environment, region, endpoint) and push identifiers like orderId into properties instead. Properties are searchable in Logs Insights but do not create new metric series. Furthermore, watch the metric resolution: standard metrics aggregate at one-minute granularity, whereas high-resolution metrics record per-second data at a higher price. Most application KPIs do not need sub-minute resolution, so reserve it for latency-sensitive systems where a one-minute average would hide a spike.
Structured Logging with CloudWatch Logs Insights
Structured JSON logging enables powerful querying with CloudWatch Logs Insights. Instead of searching through unstructured text, query specific fields, aggregate values, and visualize trends. Additionally, Logs Insights queries can be saved and added to dashboards for operational visibility.
-- CloudWatch Logs Insights: Find slowest API endpoints
fields @timestamp, @message
| filter ispresent(duration) and duration > 1000
| stats avg(duration) as avg_ms, max(duration) as max_ms,
count(*) as request_count
by endpoint
| sort avg_ms desc
| limit 20
-- Error analysis by type and service
fields @timestamp, level, message, errorType, service
| filter level = "ERROR"
| stats count(*) as error_count by errorType, service
| sort error_count desc
-- P99 latency trend over time
fields @timestamp, duration
| filter ispresent(duration)
| stats percentile(duration, 99) as p99,
percentile(duration, 95) as p95,
percentile(duration, 50) as p50
by bin(5m)
| sort @timestamp ascOne detail that pays off quickly is consistent field naming across services. Because Logs Insights queries reference fields by name, a query that aggregates duration in one service and latency_ms in another forces you to maintain two queries. A shared logging library that emits a fixed schema — level, service, traceId, duration — lets one saved query work fleet-wide. Additionally, including the X-Ray traceId in every log line is what lets you pivot from a slow request in the logs straight to its trace.
Composite Alarms and Anomaly Detection
Composite alarms combine multiple alarm states to reduce noise. Instead of alerting on every metric threshold breach, composite alarms trigger only when multiple conditions indicate a real problem. Furthermore, CloudWatch Anomaly Detection uses ML to establish baselines and alert on deviations rather than on a fixed number you guessed at deploy time.
Consider an API where a brief latency blip is normal but sustained high latency combined with rising errors is not. A composite alarm expresses exactly that joint condition, so the page only fires when both signals agree:
{
"AlarmName": "OrderService-Degraded",
"AlarmRule": "ALARM(\"OrderService-HighLatency\") AND ALARM(\"OrderService-HighErrorRate\")",
"AlarmActions": ["arn:aws:sns:us-east-1:123456789:pagerduty-critical"],
"ActionsSuppressor": "OrderService-Deploying",
"ActionsSuppressorWaitPeriod": 300
}The ActionsSuppressor field is underused and valuable: it silences the alarm while a deployment alarm is active, which kills the flood of false pages that every rollout otherwise generates. For anomaly detection, attach a band to a metric and alarm when values fall outside it; this catches a drop to zero traffic — a real outage that a “greater-than” threshold would happily ignore.
X-Ray Distributed Tracing
CloudWatch integrates with AWS X-Ray for distributed tracing across Lambda, ECS, EC2, and API Gateway. Traces show the complete request path with timing for each service hop, making it easy to identify bottlenecks. Importantly, X-Ray samples rather than capturing every request — the default rule traces one request per second plus five percent of the remainder — so you keep visibility into representative behavior without paying to store every trace.
The real power appears when traces, metrics, and logs share context. Because each trace carries a traceId and segments can hold annotations, you can pivot in one click from a slow span in the trace map to the exact log lines that span produced. For example, annotating a segment with the customer tier or the downstream table name turns the X-Ray service map into a filterable diagnostic surface rather than a static picture. Therefore, a typical investigation flows from a composite alarm, to the Logs Insights query that quantifies the blast radius, to the X-Ray trace that pinpoints which service hop added the latency — all within the same console.
Two practical notes matter for production. First, propagate the trace header (X-Amzn-Trace-Id) across every hop, including outbound calls your own code makes, or the trace will break at the first un-instrumented boundary. Second, tune the sampling rules per route: keep aggressive sampling on a noisy health-check endpoint but trace a much higher fraction of checkout or payment requests, since those are the ones worth the storage. See the CloudWatch documentation for setup guides.
When NOT to Standardize on CloudWatch — Trade-offs
CloudWatch is the path of least resistance on AWS, but it is not a universal fit. In a genuinely multi-cloud or hybrid estate, a vendor-neutral stack built on OpenTelemetry plus a tool like Grafana or Datadog avoids splitting your observability across providers. Moreover, at very high log and custom-metric volumes the bill grows faster than many teams expect, and ingestion plus storage charges can rival the compute they monitor; setting log retention policies and trimming unused metrics is not optional at scale.
There are functional gaps too. Logs Insights is excellent for ad-hoc investigation but is not a long-term log warehouse, and its query latency over large time ranges can frustrate. Cross-account, cross-region dashboards require deliberate setup with observability access manager rather than working out of the box. Therefore, the pragmatic stance is to lean on CloudWatch for AWS-native workloads where the integrations save real engineering time, instrument with OpenTelemetry so you retain portability, and revisit the decision once your spend or your cloud footprint outgrows a single provider.
Related Reading:
In conclusion, AWS CloudWatch observability provides a comprehensive platform for monitoring production systems. Use custom metrics with EMF for application KPIs, structured logging with Logs Insights for debugging, composite alarms for noise reduction, and X-Ray for distributed tracing. Build observability into your applications from day one — it is far easier than retrofitting it after an incident, and watch cardinality and retention so the platform stays affordable as you grow.