AWS Terraform architecture - Docs by LangChain

This page documents what the AWS Terraform modules provision and how the modules wire the resulting deployment together.

Platform layers

LangSmith on AWS deploys in two stages with one optional add-on. The infrastructure stage provisions the cloud foundation. The application stage installs the LangSmith Helm chart. The LangSmith Deployment add-on is opt-in and adds the host-backend, listener, and operator services for managing LangGraph applications from the UI.

LangSmith Deployment add-on  (enable_langsmith_deployments = true)
  host-backend, listener, operator
  Per deployed graph: api-server, queue, redis, postgres (operator-managed)
  Requires: KEDA (installed alongside infrastructure via k8s-bootstrap)

LangSmith application  (deploy_langsmith = true)
  backend, frontend, playground, queue, ace-backend, clickhouse
  Storage: RDS PostgreSQL (metadata) + S3 (trace blobs via VPC endpoint)
  Ingress: AWS ALB | NGINX | Envoy Gateway | Istio

AWS infrastructure
  VPC + private/public subnets + single NAT gateway
  EKS cluster + managed node group + cluster autoscaler
  RDS PostgreSQL (private subnets)
  ElastiCache Redis (private subnets)
  S3 bucket + VPC Gateway Endpoint (no public route)
  ALB controller + EBS CSI driver + metrics server
  k8s-bootstrap: KEDA, ESO, optional Envoy Gateway
  Optional: Network Firewall, WAF, CloudTrail, ALB access logs

Component to storage mapping

Component	Storage backend	Access method
`backend`	RDS PostgreSQL	Private subnet, security group
`backend`	S3 bucket	IRSA + VPC Gateway Endpoint
`clickhouse`	EBS volume (GP3, EKS PVC)	Local
`redis`	ElastiCache or in-cluster	Private subnet, security group
LGP operator	RDS PostgreSQL (shared)	Private subnet, security group

Application core services

These pods run on every deployment. All write logs and metrics; the busier components (backend, queue, ingest-queue) scale horizontally.

Service	Purpose	Port	HPA	IRSA	Depends on
`langsmith-frontend`	React UI	3000	1 to 10	No	`backend`, `platform-backend`
`langsmith-backend`	Main API (traces, runs, projects, API keys, feedback)	1984	3 to 10	Yes (S3)	Postgres, Redis, ClickHouse, S3
`langsmith-platform-backend`	Org and user management, auth, billing, settings	1986	1 to 10	Yes (S3)	Postgres, Redis, S3
`langsmith-playground`	LLM prompt playground UI	3001	1 to 10	No	`backend`
`langsmith-queue`	Trace ingestion worker (Redis to ClickHouse + S3)	—	3 to 10 + KEDA	Yes	Redis, ClickHouse, S3
`langsmith-ingest-queue`	Dedicated high-throughput ingestion worker	—	3 to 10 + KEDA	Yes	Redis, S3
`langsmith-ace-backend`	Async compute (dataset runs, evaluations, background jobs)	—	1 to 5	No	Postgres, Redis
`langsmith-clickhouse`	Columnar store (trace spans, run metadata, eval results)	—	StatefulSet, single replica	No	EBS GP3 PVC

In-cluster ClickHouse is dev/POC only (single pod, no replication, no backups). For production use LangChain Managed ClickHouse or a self-managed external cluster.

One-time jobs

The Helm chart runs three jobs at install and upgrade time:

Job	Purpose
`langsmith-backend-migrations`	PostgreSQL schema migrations
`langsmith-backend-ch-migrations`	ClickHouse schema migrations
`langsmith-backend-auth-bootstrap`	Creates the initial org and admin account from `initial_org_admin_password` in `langsmith-config`

LangSmith Deployment add-on

When enable_langsmith_deployments = true, three additional services are installed and a LangGraphPlatform CRD is registered. Each deployment the user creates in the LangSmith UI produces a Kubernetes Deployment in the langsmith namespace, managed by the operator.

Service	Purpose
`langsmith-host-backend`	LangGraph control plane API. Manages deployment lifecycle, serves deployment metadata. IRSA for S3 access.
`langsmith-listener`	Watches host-backend for deployment state changes, creates and updates `LangGraphPlatform` CRDs. IRSA for S3 access.
`langsmith-operator`	Kubernetes operator. Reconciles `LangGraphPlatform` CRDs, creates and deletes Deployments and Services for each agent.

AWS managed services

When postgres_source = "external" and redis_source = "external" (the recommended production setting), Terraform provisions the following AWS managed services:

RDS PostgreSQL

Default size: db.t3.large, private subnets, port 5432.
Holds orgs, users, projects, API keys, settings.
Secret flow: SSM /langsmith/{base_name}/postgres-password → ESO → langsmith-config.

ElastiCache Redis

Default size: cache.m5.large, private subnets, TLS port 6379.
Trace ingestion queue, pub/sub, short-lived cache.
Secret flow: SSM /langsmith/{base_name}/redis-auth-token → ESO → langsmith-config.

S3 bucket

Trace payloads: large inputs and outputs, attachments.
IRSA via langsmith_irsa_role (no static keys). VPC Gateway Endpoint, no public internet.
Prefixes: ttl_s/ (short TTL) and ttl_l/ (long TTL).
Always required. Disabling blob storage breaks the cluster on large payloads.

SSM Parameter Store

Centralized secret store for all LangSmith secrets.
Flow: source infra/scripts/setup-env.sh writes secrets to SSM. The ESO ClusterSecretStore reads them and projects a langsmith-config Kubernetes Secret that the Helm chart mounts via config.existingSecretName.
Prefix: /langsmith/{name_prefix}-{environment}/.

Cluster infrastructure

The k8s-bootstrap Terraform module installs the cluster-level services that LangSmith depends on:

Service	Namespace	IRSA	Purpose
`aws-load-balancer-controller`	`kube-system`	Yes	Provisions the AWS ALB from Kubernetes Ingress objects. Deleting the Ingress deprovisions the ALB and assigns a new DNS name on recreate, which breaks DNS records and OIDC redirect URIs.
`cluster-autoscaler`	`kube-system`	Yes	Scales EC2 node groups based on pod scheduling pressure.
`ebs-csi-driver`	`kube-system`	Yes	Provisions EBS volumes for PersistentVolumeClaims (used by ClickHouse).
KEDA	`keda`	No	Kubernetes Event-driven Autoscaling. Scales `queue` and `ingest-queue` on Redis queue depth. Required for the LangSmith Deployment add-on.
cert-manager	`cert-manager`	Optional (Route 53 IRSA when `letsencrypt`)	Automates TLS certificate issuance. Installed always; active for Let’s Encrypt only.
External Secrets Operator	`external-secrets`	Yes	Syncs SSM parameters into the `langsmith-config` Kubernetes Secret.

IRSA roles

IRSA replaces static credentials. The EKS cluster’s OIDC issuer is the trust anchor; service accounts in langsmith and kube-system are annotated with role ARNs and pods receive temporary credentials via the EKS token webhook.

Role	Defined in	Used by	Permissions
`langsmith_irsa_role`	`modules/eks`	`backend`, `platform-backend`, `queue`, `ingest-queue`, host-backend, listener	`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` on the LangSmith bucket
`aws_iam_role.eso`	`aws/infra/main.tf`	ESO controller	`ssm:GetParameter`, `ssm:GetParameters` on `/langsmith/*`

Network topology

Default — ALB ingress

Internet
  → AWS Application Load Balancer (port 80 or 443, TLS via ACM or Let's Encrypt)
    → EKS Cluster (private subnets)
      • kube-system: aws-load-balancer-controller, cluster-autoscaler, ebs-csi-driver, keda
      • langsmith:   backend, frontend, playground, queue, clickhouse
                     redis (in-cluster) OR ElastiCache (private subnet)
                     RDS PostgreSQL (private subnet)
                     S3 bucket (VPC Gateway Endpoint, no public route)

Envoy Gateway — opt-in

Internet
  → AWS Network Load Balancer (NLB, ACM TLS termination at 443)
    → envoy-gateway-system: Envoy proxy (GatewayClass: eg, Gateway: langsmith-gateway)
      → langsmith namespace:        backend, frontend, playground, queue, clickhouse, ...
      → langsmith-agents namespace (optional dataplane): langgraph-dataplane listener + operator + agent pods
         (HTTPRoute attaches to shared langsmith-gateway via allowedRoutes: All)

Egress path with Network Firewall

When create_firewall = true, all outbound internet traffic from private subnets is inspected before reaching the NAT gateway:

EKS pods / RDS / ElastiCache (private subnets)
  → AWS Network Firewall (TLS SNI + HTTP Host inspection)
     ALLOWLIST: firewall_allowed_fqdns (default: beacon.langchain.com)
     DROP: all other established connections
  → NAT Gateway (public subnet)
  → Internet

Pod-to-pod, pod-to-RDS, and pod-to-ElastiCache traffic uses the local VPC route and never touches the firewall.

Ingress options

Four mutually exclusive ingress options ship with the modules. The choice determines whether split dataplane (agent pods in a separate namespace) is supported.

Option	Variable	Split dataplane	Traffic path	When to use
ALB (AWS LBC)	default	No	`ALB → frontend NodePort`	Default. Single-namespace deployments, POC, simplest TLS via ACM.
NGINX Ingress	`enable_nginx_ingress = true`	No	`ALB → TGB → NGINX controller → frontend ClusterIP`	When NGINX is the standard ingress in your organization.
Envoy Gateway	`enable_envoy_gateway = true`	Yes	`ALB → TGB → Envoy proxy:10080 → HTTPRoute → services`	Cross-namespace HTTPRoute routing. Recommended for split dataplane on new AWS deployments.
Istio	`enable_istio_gateway = true`	Yes	`ALB → TGB → istio-ingressgateway:80 → VirtualService → services`	Clusters with Istio already installed, or when an mTLS mesh is required.

Why ALB cannot support split dataplane

Standard Kubernetes Ingress is namespace-scoped. The ALB controller routes only to services in the same namespace as the Ingress resource. Agent pods in langsmith-agents are invisible to an Ingress in langsmith. Envoy Gateway and Istio both support cross-namespace routing via the Kubernetes Gateway API.

ALB plus Envoy Gateway (chained)

When the existing ALB already provides SSO (Okta or Cognito OIDC), WAF, and TLS, Envoy Gateway slots in behind it instead of replacing it:

Internet
  → ALB (unchanged: WAF, SSO, TLS, DNS)
    → Envoy Gateway NLB (internal-scheme, auto-provisioned by k8s-bootstrap)
       → HTTPRoute → langsmith namespace        (control plane)
       → HTTPRoute → langsmith-agents namespace (split dataplane)

The only change from the default ALB path is retargeting the ALB target group to the Envoy NLB. See helm/values/examples/langsmith-values-ingress-envoy-gateway.yaml in the modules repo for the values overlay.

TLS and DNS

The tls_certificate_source variable controls the certificate strategy:

Mode	Behavior	Compatible gateways
`none`	HTTP only, no certificate	Any
`acm`	HTTPS:443 with HTTP→HTTPS redirect. ACM certificate, auto-provisioned or BYO.	ALB, NGINX
`letsencrypt`	HTTPS via cert-manager + Let’s Encrypt DNS-01 (Route 53 IRSA)	Istio, Envoy

Why ACM versus cert-manager

ACM certificates are non-exportable. AWS attaches them directly to the ALB, which makes ACM the right choice when TLS terminates at the ALB. ACM cannot be used when TLS terminates inside the cluster (Istio Gateway, Envoy Gateway) because those gateways require the certificate material as a Kubernetes Secret. cert-manager handles in-cluster TLS for Istio and Envoy. The letsencrypt value is a reference implementation: it installs cert-manager and a Let’s Encrypt ACME ClusterIssuer. In production, swap the ClusterIssuer for any cert-manager-compatible issuer.

Issuer	When to use
Let’s Encrypt (default)	Public domain, internet access, free
ACM Private CA (`aws-privateca-issuer`)	AWS-native, air-gap friendly, private domains, paid
Venafi (`cert-manager-venafi`)	Enterprise PKI, regulated environments
HashiCorp Vault (`cert-manager-vault`)	Self-hosted PKI
DigiCert, Sectigo, others	ACME or custom issuer plugins

The Terraform module provisions the cert-manager IRSA role and Route 53 permissions. Only the ClusterIssuer manifest changes between issuers.

Auto-provisioned DNS

When langsmith_domain is set and acm_certificate_arn is empty, Terraform activates the dns module which creates:

A Route 53 hosted zone for the domain.
An ACM certificate with DNS validation records.
A Route 53 alias record pointing the domain to the ALB.

Staged deploy pattern: Set langsmith_domain with tls_certificate_source = "none" first. Terraform creates the hosted zone and certificate without blocking on validation. Delegate the NS records at your registrar, then flip to tls_certificate_source = "acm" in a later apply. Terraform blocks until the certificate validates and wires it into the HTTPS listener.

Bring your own certificate

Set acm_certificate_arn directly to skip the dns module. For in-cluster gateways, create a Kubernetes TLS Secret manually and reference it in the Gateway or VirtualService.

Module dependency graph

vpc ─► firewall (optional, create_firewall = true)
│
├─► eks ─► k8s-bootstrap (KEDA, ESO, Envoy Gateway [opt-in])
│            └─► cert-manager (Let's Encrypt DNS-01 via Route 53 IRSA)
│
├─► postgres    (RDS, private subnets from VPC)
├─► redis       (ElastiCache, private subnets from VPC)
├─► storage     (S3 bucket + VPC Gateway Endpoint)
├─► alb         (pre-provisioned ALB, public subnets)
│     └─► alb_access_logs (S3 bucket for access logs, opt-in)
├─► dns         (Route 53 zone + ACM cert, optional)
├─► bastion     (jump host for private EKS access, optional)
├─► cloudtrail  (audit logging, optional)
├─► waf         (WAF ACL on ALB, optional)
└─► firewall    (Network Firewall egress filter, optional)
       all ─► langsmith (root module)

Opt-in security modules

Module	Variable	Default	Purpose
Network Firewall	`create_firewall`	`false`	FQDN-based egress filtering. Allows only domains in `firewall_allowed_fqdns` (TLS SNI + HTTP Host). Requires `create_vpc = true`. Cost ≈ `$0.40/hr/endpoint + $0.065/GB processed`.
ALB access logs	`alb_access_logs_enabled`	`false`	Traffic analysis and compliance
CloudTrail	`create_cloudtrail`	`false`	API call logging. Skip if an organization trail already exists.
WAF	`create_waf`	`false`	WAFv2 Web ACL — OWASP Top 10, IP reputation, known bad inputs

Default resource sizes

Resource	Default	vCPU	Memory
EKS node	`m5.4xlarge`	16	64 GB
RDS PostgreSQL	`db.t3.large`	2	8 GB
ElastiCache Redis	`cache.m6g.xlarge`	4	13.07 GB
RDS storage	10 GB	—	—

For production sizing recommendations, see the scaling guide and the AWS deployment guide.

Validated behaviors and known constraints

These constraints were validated during the April 2026 gateway permutation test run.

#	Area	Constraint or fix
1	ACM wildcard SANs	`langchain.com` has `0 issue "amazon.com"` CAA but not `0 issuewild "amazon.com"`. Wildcard SANs fail with `CAA_ERROR`. The `dns` module requests only the apex domain.
2	In-cluster Redis	The LangSmith Helm chart deploys Redis without `requirepass`. The `k8s_bootstrap` module writes `redis://langsmith-redis:6379`. Do not add an auth token unless you also configure the Helm chart Redis values.
3	`name_prefix` length	Maximum 15 characters. Names like `dz-nginx-tst` (12 characters) are valid.
4	Istio port	Istio 1.23+ ingressgateway listens on port 80 via `NET_BIND_SERVICE`, not port 8080. ALB TGB health check and security group rules must target port 80.
5	NGINX TGB port	NGINX ingress-nginx controller pods listen on port 80. The TargetGroupBinding target type is `ip`.
6	Envoy proxy port	Envoy proxy pods listen on port 10080 (not 80) when running as non-root. The TGB `servicePort` must be 10080.
7	Destroy order	Always run `terraform destroy` first and let Terraform handle namespace and Helm release lifecycle. Pre-deleting namespaces causes the `helm_release` resource to time out because Helm cannot uninstall cleanly into a terminating namespace.
8	Stuck terminating namespaces	KEDA’s stale `external.metrics.k8s.io/v1beta1` API group causes `NamespaceDeletionDiscoveryFailure`. Fix: `kubectl delete apiservice v1beta1.external.metrics.k8s.io` before re-running `terraform destroy`.

Verification commands

# EKS cluster status
aws eks describe-cluster --name <cluster-name> --query "cluster.status"

# Node health
kubectl get nodes -o wide

# ALB status
kubectl get ingress -n langsmith

# RDS status
aws rds describe-db-instances \
  --query "DBInstances[?DBInstanceIdentifier=='<db-id>'].DBInstanceStatus"

# ElastiCache status
aws elasticache describe-replication-groups \
  --query "ReplicationGroups[?ReplicationGroupId=='<group-id>'].Status"

# S3 access from a pod (via VPC endpoint)
kubectl run s3-test --rm -it --image=amazon/aws-cli -n langsmith -- \
  aws s3 ls s3://<bucket-name>

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Edit this page on GitHub or file an issue.

Documentation Index

​Platform layers

​Component to storage mapping

​Application core services

​One-time jobs

​LangSmith Deployment add-on

​AWS managed services

​RDS PostgreSQL

​ElastiCache Redis

​S3 bucket

​SSM Parameter Store

​Cluster infrastructure

​IRSA roles

​Network topology

​Default — ALB ingress

​Envoy Gateway — opt-in

​Egress path with Network Firewall

​Ingress options

​Why ALB cannot support split dataplane

​ALB plus Envoy Gateway (chained)

​TLS and DNS

​Why ACM versus cert-manager

​Auto-provisioned DNS

​Bring your own certificate

​Module dependency graph

​Opt-in security modules

​Default resource sizes

​Validated behaviors and known constraints

​Verification commands

Platform layers

Component to storage mapping

Application core services

One-time jobs

LangSmith Deployment add-on

AWS managed services

RDS PostgreSQL

ElastiCache Redis

S3 bucket

SSM Parameter Store

Cluster infrastructure

IRSA roles

Network topology

Default — ALB ingress

Envoy Gateway — opt-in

Egress path with Network Firewall

Ingress options

Why ALB cannot support split dataplane

ALB plus Envoy Gateway (chained)

TLS and DNS

Why ACM versus cert-manager

Auto-provisioned DNS

Bring your own certificate

Module dependency graph

Opt-in security modules

Default resource sizes

Validated behaviors and known constraints

Verification commands