Azure AKS: Production Kubernetes on Microsoft Cloud
Azure AKS Kubernetes production deployments leverage Azure’s deep enterprise integration for running containerized workloads. AKS provides a managed Kubernetes control plane with seamless Azure AD authentication, Azure Monitor integration, and native VNet networking. Therefore, organizations already invested in the Microsoft ecosystem can run Kubernetes with familiar security and identity models. Moreover, because Azure operates the control plane, your team is freed from patching etcd and the API server, and can focus on the workloads that actually deliver value.
AKS stands out from EKS and GKE in its Azure AD integration for RBAC, Azure Policy for governance, and Container Insights for monitoring. Moreover, AKS offers a free control plane on the base tier — you only pay for worker nodes. Consequently, AKS is often the most cost-effective managed Kubernetes option for Windows-heavy and .NET workloads. That said, the “free” control plane has no financially backed uptime SLA; production clusters should select the Standard tier, which adds a 99.95% control-plane SLA for a modest hourly fee.
Azure AKS Kubernetes Production: Cluster Setup
Create a production AKS cluster with multiple node pools, Azure CNI networking, and managed identity. System node pools run control plane components while user node pools run application workloads. Furthermore, enable cluster autoscaler and Azure AD RBAC from the start. Separating system and user pools matters more than it appears: pinning kube-system daemons to a dedicated, tainted system pool prevents a noisy application from starving CoreDNS or the metrics server, which is a common cause of mysterious cluster-wide latency.
# Create production AKS cluster
az aks create \
--resource-group prod-rg \
--name prod-cluster \
--kubernetes-version 1.29 \
--node-count 3 \
--node-vm-size Standard_D4s_v5 \
--nodepool-name system \
--network-plugin azure \
--network-policy calico \
--vnet-subnet-id /subscriptions/.../subnets/aks-subnet \
--enable-managed-identity \
--enable-aad \
--aad-admin-group-object-ids "xxx-xxx" \
--enable-azure-rbac \
--enable-addons monitoring \
--workspace-resource-id /subscriptions/.../workspaces/logs \
--zones 1 2 3 \
--tier standard
# Add application node pool
az aks nodepool add \
--resource-group prod-rg \
--cluster-name prod-cluster \
--name apps \
--node-count 3 \
--min-count 2 \
--max-count 20 \
--enable-cluster-autoscaler \
--node-vm-size Standard_D8s_v5 \
--zones 1 2 3 \
--labels workload=application \
--node-taints dedicated=apps:NoScheduleOne decision deserves early attention: the network plugin. Azure CNI assigns every pod a real VNet IP, which gives you native connectivity and fine-grained network policy but consumes address space quickly—each node reserves a block of IPs whether pods use them or not. Therefore, size your subnet generously up front, because a subnet that is too small caps how far the cluster can scale and cannot be resized without rebuilding. For clusters that will grow large, the Azure CNI Overlay mode decouples pod IPs from the VNet and sidesteps the exhaustion problem entirely.
Azure AD Integration and RBAC
AKS integrates with Azure AD for authentication, mapping Kubernetes RBAC to Azure AD groups. This means developers authenticate with their corporate credentials and access is managed through familiar Azure AD group memberships. Additionally, Azure RBAC extends Kubernetes authorization with Azure-native roles. The practical payoff is that offboarding becomes trivial: removing someone from a Microsoft Entra group instantly revokes their cluster access, with no orphaned kubeconfig files lingering on laptops.
# Kubernetes RBAC mapped to Azure AD group
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: dev-team-access
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: edit
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: "aad-group-id-for-developers" # Azure AD group IDFor application identity rather than human identity, prefer Workload Identity Federation over storing service-principal secrets. With it, a pod’s Kubernetes service account is federated to a managed identity, and Azure exchanges the projected token for an access token at runtime. As a result, your pods reach Key Vault or Storage with no long-lived credentials in the cluster at all, closing one of the most common secret-leak vectors.
Production-Ready Workloads: Health, Budgets, and Resources
A cluster is only as reliable as the workloads on it, and AKS will faithfully run a fragile deployment into the ground. Three settings separate a resilient service from a flaky one: accurate resource requests, liveness and readiness probes, and a pod disruption budget. Requests let the scheduler place pods sensibly and drive autoscaling decisions; probes let Kubernetes restart hung pods and stop routing traffic to ones that are not ready; and a disruption budget protects you during voluntary disruptions such as node upgrades, which are routine on AKS.
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
spec:
replicas: 3
template:
spec:
nodeSelector:
workload: application
tolerations:
- key: dedicated
operator: Equal
value: apps
effect: NoSchedule
containers:
- name: orders-api
image: myregistry.azurecr.io/orders-api:1.4.2
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
memory: "512Mi" # omit a CPU limit to avoid throttling
readinessProbe:
httpGet: { path: /healthz/ready, port: 8080 }
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet: { path: /healthz/live, port: 8080 }
initialDelaySeconds: 15
periodSeconds: 20
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: orders-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: orders-apiNote one deliberate choice above: the deployment sets a memory limit equal to its request but omits a CPU limit. Setting a hard CPU limit causes the kernel’s CFS scheduler to throttle the container even when the node has spare capacity, which manifests as inexplicable tail latency. Conversely, memory has no graceful throttling—exceeding the limit triggers an OOM kill—so a memory limit is genuinely protective. This asymmetry is widely recommended but easy to get backwards.
Container Insights and Monitoring
Container Insights provides comprehensive monitoring for AKS — node health, pod metrics, container logs, and Prometheus metrics collection. Furthermore, it integrates with Azure Monitor workbooks for pre-built dashboards and custom alerting rules. Be deliberate about log ingestion, however, because Container Insights bills on the volume of data sent to the Log Analytics workspace, and a chatty debug logger across hundreds of pods can generate a surprising monthly bill. Therefore, tune data collection rules to sample or exclude verbose namespaces, and lean on the managed Prometheus and Azure Managed Grafana offerings for high-cardinality metrics that you do not need to retain as raw logs.
Cost Optimization
Use spot node pools for fault-tolerant workloads (up to 80% savings versus on-demand, per Azure’s spot pricing), cluster autoscaler for right-sizing, and Azure Reserved Instances for baseline capacity. Additionally, the AKS cost analysis view in Azure Portal shows per-namespace cost breakdowns. Spot nodes carry a real caveat: Azure can evict them with only about 30 seconds of notice when it reclaims capacity, so reserve them strictly for batch jobs, CI runners, and stateless services that tolerate sudden node loss—never for stateful databases or anything without graceful-shutdown handling. See the AKS documentation for production best practices.
When AKS Is Not the Right Fit
For all its strengths, AKS is not a universal answer. If your organization runs primarily on AWS or Google Cloud, the gravitational pull of co-located data and existing IAM usually outweighs AKS’s integration advantages; cross-cloud egress fees and split identity models erode the benefit. Likewise, for a single small service, the operational surface of any Kubernetes cluster—upgrades, networking, RBAC, observability—is hard to justify when Azure Container Apps or App Service would run the same container with a fraction of the maintenance. Honestly, Kubernetes earns its keep at the scale where you are orchestrating many services, not where you are hosting one.
Even within Azure, weigh the upgrade cadence. AKS supports a given Kubernetes minor version for roughly a year, so a production cluster commits you to a recurring upgrade treadmill that you must plan and test for. Teams that treat upgrades as an afterthought eventually find themselves forced onto a new version on Azure’s timeline rather than their own. Therefore, budget for that maintenance as an ongoing cost, not a one-time setup task.
Key Takeaways
- Start with a solid foundation and build incrementally based on your requirements
- Test thoroughly in staging before deploying to production environments
- Monitor performance metrics and iterate based on real-world data
- Follow security best practices and keep dependencies up to date
- Document architectural decisions for future team members
In conclusion, Azure AKS Kubernetes production deployments benefit from deep Azure ecosystem integration — Azure AD for identity, Container Insights for monitoring, and Azure Policy for governance. If your organization already uses Microsoft tools, AKS provides the most natural Kubernetes experience with enterprise security baked in. Nevertheless, treat node pool design, workload health settings, and the upgrade lifecycle as first-class concerns, because the managed control plane handles only half of what production reliability actually demands.