Zero-Trust Networking for AI Agents: Protecting Your Internal VPC

Series: Secure Workload Orchestration

Part 1Building a Polymorphic Workload Orchestrator with Go and Kubernetes
Part 2Hardening the Control Plane: From Pods to MicroVMs
Part 3Zero-Trust Networking for AI Agents: Protecting Your Internal VPC
Part 4Enterprise Audit Logging and Monitoring: The Missing Piece of the Control Plane
Part 5Control Plane Resilience: Handling Cascading Failures in Distributed Orchestration


type: deep-dive
difficulty: advanced
keyTakeaways:
- Implement default-deny Network Policies for workload isolation
- Prevent SSRF attacks by blocking private IP ranges
- Configure Cilium Egress Gateways for static IP requirements
prerequisites: Secure MicroVM runtime (Part 2), Kubernetes networking, CNI concepts
targetAudience: Security architects designing zero-trust AI agent networks

In Hardening the Control Plane: From Pods to MicroVMs, we solved the compute isolation problem by wrapping untrusted agents in Firecracker MicroVMs.

But computation is only half the battle. AI agents are useful because they connect to things. They need to crawl websites, call external APIs, and fetch data.

However, giving an untrusted agent unrestricted network access is a massive security risk. Server-Side Request Forgery (SSRF) is the #1 threat here. If an agent can curl your minimal Metadata Service (169.254.169.254) or ping your internal database, your isolation is broken.

In this post, we'll upgrade our Go Control Plane to automatically provision Zero-Trust Network Policies.

The Goal: "Default Deny"

We want a networking model where:

Ingress is Blocked: No one can call the agent (except maybe Prometheus).
Internal Egress is Blocked: The agent cannot talk to other Pods, Services, or the Node IP.
Public Egress is Allowed: The agent can reach the public internet (Google, OpenAI, etc.).

Default Deny Policy

We don't want to rely on manual configuration. We'll update our Go FirecrackerRuntime to create a NetworkPolicy alongside every Pod.

package kubernetes

import (
    networkingv1 "k8s.io/api/networking/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func (f *FirecrackerRuntime) createNetworkPolicy(ctx context.Context, id string) error {
    policy := &networkingv1.NetworkPolicy{
        ObjectMeta: metav1.ObjectMeta{
            Name: "isolate-" + id,
            Labels: map[string]string{
                "managed-by": "orchestrator",
            },
        },
        Spec: networkingv1.NetworkPolicySpec{
            PodSelector: metav1.LabelSelector{
                MatchLabels: map[string]string{
                    "job-id": id, // Target the specific agent
                },
            },
            PolicyTypes: []networkingv1.PolicyType{
                networkingv1.PolicyTypeEgress,
            },
            Egress: []networkingv1.NetworkPolicyEgressRule{
                {
                    // RULE 1: Allow DNS (Essential!)
                    To: []networkingv1.NetworkPolicyPeer{
                        {
                            NamespaceSelector: &metav1.LabelSelector{},
                            PodSelector: &metav1.LabelSelector{
                                MatchLabels: map[string]string{"k8s-app": "kube-dns"},
                            },
                        },
                    },
                    Ports: []networkingv1.NetworkPolicyPort{
                        {Port: &intstr.IntOrString{IntVal: 53}, Protocol: &UDP},
                        {Port: &intstr.IntOrString{IntVal: 53}, Protocol: &TCP},
                    },
                },
                {
                    // RULE 2: Block Private IP Ranges (CIDR)
                    // Note: K8s NetworkPolicies are typically "Allow" lists.
                    // To implement "Allow Public, Block Private", we usually assign
                    // an "Allow-All" rule EXCLUDING private CIDRs using a CNI like Cilium.
                    //
                    // Standard K8s NetworkPolicy doesn't support "Deny" rules directly.
                    // We must use specific IPBlock "Except" clauses.
                    To: []networkingv1.NetworkPolicyPeer{
                        {
                            IPBlock: &networkingv1.IPBlock{
                                CIDR: "0.0.0.0/0",
                                Except: []string{
                                    "10.0.0.0/8",     // VPC Private Class A
                                    "172.16.0.0/12",  // VPC Private Class B
                                    "192.168.0.0/16", // VPC Private Class C
                                    "169.254.169.254/32", // AWS/Cloud Metadata
                                },
                            },
                        },
                    },
                },
            },
        },
    }

    _, err := f.clientset.NetworkingV1().NetworkPolicies(f.namespace).Create(
        ctx,
        policy,
        metav1.CreateOptions{},
    )
    return err
}

CNI Requirements

Not all Kubernetes CNIs support IPBlock exceptions correctly. We highly recommend Cilium or Calico for enforcement. AWS VPC CNI supports policies but requires the separate Network Policy Agent.

Integrating with Provision()

We simply add this call to our existing Provision method.

func (f *FirecrackerRuntime) Provision(ctx context.Context, id string, spec runtime.Spec) error {
    // 1. Create the Pod (as seen in Part 2)
    // ... pod creation logic ...

    // 2. Lock down the network
    if err := f.createNetworkPolicy(ctx, id); err != nil {
         // If policy fails, DELETE the pod to fail safe.
         // Do not leave an un-isolated pod running.
         f.Teardown(ctx, id)
         return fmt.Errorf("failed to secure network: %w", err)
    }

    return nil
}

Advanced: Egress Gateways

Sometimes, you need all agent traffic to leave from a Static IP (e.g., to whitelist your agents on a partner's API).

In standard K8s, all pods masquerade as the Node IP. To fix this, we can use a Cilium Egress Gateway by creating a CiliumEgressGatewayPolicy CRD.

manifests/egress-policy.yaml

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: agent-static-exit
spec:
  selectors:
    - podSelector:
        matchLabels:
          managed-by: orchestrator
          security: high
  destinationCIDRs:
    - "0.0.0.0/0"
  egressGateway:
    nodeSelector:
      matchLabels:
        egress-gateway: "true"
    egressIP: "203.0.113.50" # Your static Elastic IP

The policy selects our secure agent pods by label and routes their external traffic through a dedicated gateway node with a stable IP. This IP must be pre-configured on the gateway node's network interface.

Conclusion

We have now built a "Fort Knox" for AI agents:

Compute: Hardware isolation via Firecracker (Part 2).
Network: Protocol-level isolation via K8s NetworkPolicies (Part 3).
Control: Orchestrated via a unified Go interface (Part 1).

This architecture allows you to sleep at night while thousands of user-defined agents crawl the web and execute code on your platform.