AWS VPC Networking for Production Systems
AWS VPC networking architecture is the foundation of every cloud deployment. A well-designed VPC provides network isolation, controls traffic flow, and enables secure communication between services. Therefore, investing time in proper VPC design prevents costly re-architecture later and ensures your applications are secure and performant from day one. Crucially, the network layer is one of the hardest things to change once workloads are live, because CIDR ranges, route tables, and security boundaries become load-bearing dependencies for everything above them.
Most production environments need multiple VPCs — separate networks for production, staging, and development, often across multiple AWS accounts. Moreover, services need to communicate across VPCs and with on-premises networks securely. Consequently, understanding subnets, route tables, NAT gateways, Transit Gateway, and PrivateLink is essential for cloud architects. The sections below walk through each building block with concrete configuration, the edge cases that bite teams in production, and an honest look at where this design pattern stops being worth the complexity.
AWS VPC Networking Architecture: Multi-AZ Subnet Design
A production VPC should span at least 3 Availability Zones with public, private, and isolated subnet tiers. Public subnets host load balancers, private subnets host application workloads, and isolated subnets host databases with no internet access. Furthermore, use a CIDR block large enough for growth — /16 provides 65,536 IP addresses.
# CloudFormation: Production VPC with 3 AZs, 3 tiers
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: production-vpc
# Public Subnets (ALB, NAT Gateway)
PublicSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.1.0/24 # 254 IPs
AvailabilityZone: !Select [0, !GetAZs '']
MapPublicIpOnLaunch: true
PublicSubnetB:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.2.0/24
AvailabilityZone: !Select [1, !GetAZs '']
PublicSubnetC:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.3.0/24
AvailabilityZone: !Select [2, !GetAZs '']
# Private Subnets (ECS, EKS, Lambda)
PrivateSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.10.0/24
AvailabilityZone: !Select [0, !GetAZs '']
# Isolated Subnets (RDS, ElastiCache — no internet)
IsolatedSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.20.0/24
AvailabilityZone: !Select [0, !GetAZs '']
# NAT Gateway for private subnet internet access
NatGatewayA:
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt NatEIPA.AllocationId
SubnetId: !Ref PublicSubnetACIDR Planning and IP Exhaustion
The single most common mistake in VPC design is sizing CIDR blocks without planning for the future. Notably, a /24 subnet looks generous at 256 addresses, but AWS reserves 5 of them per subnet (network address, VPC router, DNS, future use, and broadcast), leaving 251 usable. That matters enormously for EKS, where the AWS VPC CNI assigns a routable VPC IP to every single pod. Consequently, a busy node running 30 pods can burn through addresses faster than you expect, and a cluster of 50 nodes can exhaust a /22 quickly.
The fix is deliberate hierarchy. Reserve a large supernet such as 10.0.0.0/16 per VPC, carve /20 blocks per tier, and only then split into /24 per-AZ subnets. Importantly, never reuse the same CIDR across two VPCs you might later connect — overlapping ranges make peering and Transit Gateway routing impossible without messy NAT. For pod-dense Kubernetes clusters, a common pattern is to add a secondary CIDR (the 100.64.0.0/10 carrier-grade NAT range) to the VPC and place the CNI there, keeping the primary range free for nodes and load balancers.
# Audit free addresses before you run out
aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=vpc-0abc123" \
--query 'Subnets[].{Subnet:SubnetId,CIDR:CidrBlock,Free:AvailableIpAddressCount}' \
--output table
# Add a secondary CIDR for pod-dense EKS clusters
aws ec2 associate-vpc-cidr-block \
--vpc-id vpc-0abc123 \
--cidr-block 100.64.0.0/16Transit Gateway: Multi-VPC Connectivity
Transit Gateway acts as a central hub connecting multiple VPCs, VPN connections, and Direct Connect gateways. Instead of managing dozens of VPC peering connections, Transit Gateway simplifies networking to a hub-and-spoke model. Furthermore, route tables on Transit Gateway control which VPCs can communicate with each other. This matters because raw VPC peering does not scale — peering is non-transitive, so connecting N VPCs to each other requires N(N-1)/2 connections, whereas a hub keeps it linear.
# Transit Gateway connecting prod, staging, shared-services VPCs
TransitGateway:
Type: AWS::EC2::TransitGateway
Properties:
AutoAcceptSharedAttachments: enable
DefaultRouteTableAssociation: disable
DefaultRouteTablePropagation: disable
DnsSupport: enable
Tags:
- Key: Name
Value: central-tgw
# Route table: Prod can reach shared-services but NOT staging
ProdRouteTable:
Type: AWS::EC2::TransitGatewayRouteTable
Properties:
TransitGatewayId: !Ref TransitGateway
ProdToSharedRoute:
Type: AWS::EC2::TransitGatewayRoute
Properties:
TransitGatewayRouteTableId: !Ref ProdRouteTable
DestinationCidrBlock: 10.1.0.0/16 # shared-services VPC
TransitGatewayAttachmentId: !Ref SharedServicesAttachmentNotice that the template above disables default route table association and propagation. That choice is deliberate — with defaults enabled, every attached VPC can reach every other one, which quietly defeats the segmentation you are trying to enforce. By using separate Transit Gateway route tables per environment, you create explicit, auditable paths: production can reach shared services, staging stays isolated, and a compromised dev VPC cannot pivot into prod. For a deeper look at how this fits alongside identity boundaries, see our guide on production VPC networking architecture.
NAT Gateway Cost and High Availability Trade-offs
NAT gateways are where networking quietly becomes expensive. Each gateway carries an hourly charge plus a per-GB data processing fee, and that processing fee applies even to traffic destined for AWS services in the same region. A frequent and painful surprise is pulling multi-gigabyte container images from a public registry or shipping logs to a SaaS endpoint, all routed through NAT at full per-GB cost. In production teams typically see NAT data processing become one of the larger line items on the networking bill.
There is also a real availability decision hiding here. A NAT gateway lives in a single Availability Zone; if that AZ fails, private subnets routed through it lose internet access. The resilient pattern is one NAT gateway per AZ with per-AZ route tables, so each zone is self-contained. However, that triples the fixed hourly cost. Therefore, many teams run a single shared NAT in non-production environments to save money and accept the lower resilience, while reserving the per-AZ design for production. The most effective cost lever, though, is avoiding NAT entirely for AWS-bound traffic by using VPC endpoints, covered next.
VPC Endpoints and PrivateLink
VPC endpoints enable private connectivity to AWS services without traversing the internet. Gateway endpoints (S3, DynamoDB) are free, while interface endpoints (PrivateLink) cost per hour and per GB. Additionally, PrivateLink enables private connectivity to third-party services and your own services across VPCs. Because a gateway endpoint for S3 routes traffic directly through the VPC route table rather than the NAT gateway, it both removes the per-GB NAT charge and keeps the data path inside the AWS backbone — a security and cost win at the same time.
Security Groups and NACLs
Security groups are stateful firewalls at the instance level — allow rules only, return traffic automatically permitted. Network ACLs are stateless firewalls at the subnet level — require explicit allow for both inbound and outbound. Use security groups as your primary control and NACLs as an additional defense layer. A powerful and underused feature is referencing one security group from another instead of hardcoding CIDR ranges — for example, allowing the database security group to accept traffic only from the application security group, regardless of which IP the app instances currently hold. See the AWS VPC documentation for complete networking reference.
NACLs deserve respect precisely because they are stateless. Since they do not track connections, you must explicitly allow the ephemeral return port range (typically 1024–65535) for outbound responses, or legitimate replies get silently dropped. This is a classic late-night debugging session: the security group looks correct, the application is healthy, yet connections hang because a custom NACL forgot the ephemeral range. As a rule, keep NACLs simple and coarse — broad subnet-level deny rules for known-bad ranges — and let security groups do the fine-grained work.
When NOT to Reach for This Architecture
This three-tier, multi-AZ, Transit Gateway design is the right default for medium-to-large production systems, but it is genuine overhead and not always justified. For a small single-service application or an early-stage project, the default VPC plus a couple of subnets is perfectly adequate, and building a Transit Gateway mesh on day one is premature optimization. Likewise, if your entire workload runs on serverless primitives such as Lambda behind API Gateway and DynamoDB, you may not need custom subnets or NAT at all — adding a VPC there can introduce cold-start penalties and cost without benefit.
The honest trade-off is operational complexity. Each VPC, Transit Gateway attachment, route table, and endpoint is another object to monitor, secure, and reason about during an incident. Conversely, fully managed networking abstractions hide that complexity but reduce your control. The decision rule is straightforward: adopt the full architecture when you have multiple environments, compliance-driven isolation requirements, or hybrid on-premises connectivity. Otherwise, start lean and let real requirements pull you toward more structure. For the broader cloud foundation context, our cloud networking deep dive expands on these decision points.
Key Takeaways
- Plan CIDR hierarchy upfront — size for EKS pod density and never overlap ranges you might later connect
- Use Transit Gateway route tables to enforce segmentation; disable default association and propagation
- Control NAT gateway cost with VPC endpoints and balance per-AZ resilience against the hourly fee
- Prefer security-group-to-security-group references over hardcoded CIDRs, and remember NACLs are stateless
- Match architecture complexity to real requirements; do not build a hub-and-spoke mesh before you need one
In conclusion, AWS VPC networking architecture requires thoughtful design upfront to avoid costly re-architecture. Use multi-AZ, three-tier subnets for isolation, Transit Gateway for multi-VPC connectivity, and VPC endpoints for private AWS service access. Plan your CIDR ranges carefully, watch the NAT bill, enforce least-privilege with referenced security groups, and document your network topology — it’s the foundation everything else builds on.