Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-featse-1779998369-ad736a3.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This page documents what the AWS Terraform modules provision and how the modules wire the resulting deployment together.

Platform layers

LangSmith on AWS deploys in two stages with one optional add-on. The infrastructure stage provisions the cloud foundation. The application stage installs the LangSmith Helm chart. The LangSmith Deployment add-on is opt-in and adds the host-backend, listener, and operator services for managing LangGraph applications from the UI. LangSmith on AWS service layout
LangSmith Deployment add-on  (enable_langsmith_deployments = true)
  host-backend, listener, operator
  Per deployed graph: api-server, queue, redis, postgres (operator-managed)
  Requires: KEDA (installed alongside infrastructure via k8s-bootstrap)

LangSmith application  (deploy_langsmith = true)
  backend, frontend, playground, queue, ace-backend, clickhouse
  Storage: RDS PostgreSQL (metadata) + S3 (trace blobs via VPC endpoint)
  Ingress: AWS ALB | NGINX | Envoy Gateway | Istio

AWS infrastructure
  VPC + private/public subnets + single NAT gateway
  EKS cluster + managed node group + cluster autoscaler
  RDS PostgreSQL (private subnets)
  ElastiCache Redis (private subnets)
  S3 bucket + VPC Gateway Endpoint (no public route)
  ALB controller + EBS CSI driver + metrics server
  k8s-bootstrap: KEDA, ESO, optional Envoy Gateway
  Optional: Network Firewall, WAF, CloudTrail, ALB access logs

Component to storage mapping

ComponentStorage backendAccess method
backendRDS PostgreSQLPrivate subnet, security group
backendS3 bucketIRSA + VPC Gateway Endpoint
clickhouseEBS volume (GP3, EKS PVC)Local
redisElastiCache or in-clusterPrivate subnet, security group
LGP operatorRDS PostgreSQL (shared)Private subnet, security group

Application core services

These pods run on every deployment. All write logs and metrics; the busier components (backend, queue, ingest-queue) scale horizontally.
ServicePurposePortHPAIRSADepends on
langsmith-frontendReact UI30001 to 10Nobackend, platform-backend
langsmith-backendMain API (traces, runs, projects, API keys, feedback)19843 to 10Yes (S3)Postgres, Redis, ClickHouse, S3
langsmith-platform-backendOrg and user management, auth, billing, settings19861 to 10Yes (S3)Postgres, Redis, S3
langsmith-playgroundLLM prompt playground UI30011 to 10Nobackend
langsmith-queueTrace ingestion worker (Redis to ClickHouse + S3)3 to 10 + KEDAYesRedis, ClickHouse, S3
langsmith-ingest-queueDedicated high-throughput ingestion worker3 to 10 + KEDAYesRedis, S3
langsmith-ace-backendAsync compute (dataset runs, evaluations, background jobs)1 to 5NoPostgres, Redis
langsmith-clickhouseColumnar store (trace spans, run metadata, eval results)StatefulSet, single replicaNoEBS GP3 PVC
In-cluster ClickHouse is dev/POC only (single pod, no replication, no backups). For production use LangChain Managed ClickHouse or a self-managed external cluster.

One-time jobs

The Helm chart runs three jobs at install and upgrade time:
JobPurpose
langsmith-backend-migrationsPostgreSQL schema migrations
langsmith-backend-ch-migrationsClickHouse schema migrations
langsmith-backend-auth-bootstrapCreates the initial org and admin account from initial_org_admin_password in langsmith-config

LangSmith Deployment add-on

When enable_langsmith_deployments = true, three additional services are installed and a LangGraphPlatform CRD is registered. Each deployment the user creates in the LangSmith UI produces a Kubernetes Deployment in the langsmith namespace, managed by the operator.
ServicePurpose
langsmith-host-backendLangGraph control plane API. Manages deployment lifecycle, serves deployment metadata. IRSA for S3 access.
langsmith-listenerWatches host-backend for deployment state changes, creates and updates LangGraphPlatform CRDs. IRSA for S3 access.
langsmith-operatorKubernetes operator. Reconciles LangGraphPlatform CRDs, creates and deletes Deployments and Services for each agent.

AWS managed services

When postgres_source = "external" and redis_source = "external" (the recommended production setting), Terraform provisions the following AWS managed services:

RDS PostgreSQL

  • Default size: db.t3.large, private subnets, port 5432.
  • Holds orgs, users, projects, API keys, settings.
  • Secret flow: SSM /langsmith/{base_name}/postgres-password → ESO → langsmith-config.

ElastiCache Redis

  • Default size: cache.m5.large, private subnets, TLS port 6379.
  • Trace ingestion queue, pub/sub, short-lived cache.
  • Secret flow: SSM /langsmith/{base_name}/redis-auth-token → ESO → langsmith-config.

S3 bucket

  • Trace payloads: large inputs and outputs, attachments.
  • IRSA via langsmith_irsa_role (no static keys). VPC Gateway Endpoint, no public internet.
  • Prefixes: ttl_s/ (short TTL) and ttl_l/ (long TTL).
  • Always required. Disabling blob storage breaks the cluster on large payloads.

SSM Parameter Store

  • Centralized secret store for all LangSmith secrets.
  • Flow: source infra/scripts/setup-env.sh writes secrets to SSM. The ESO ClusterSecretStore reads them and projects a langsmith-config Kubernetes Secret that the Helm chart mounts via config.existingSecretName.
  • Prefix: /langsmith/{name_prefix}-{environment}/.

Cluster infrastructure

The k8s-bootstrap Terraform module installs the cluster-level services that LangSmith depends on:
ServiceNamespaceIRSAPurpose
aws-load-balancer-controllerkube-systemYesProvisions the AWS ALB from Kubernetes Ingress objects. Deleting the Ingress deprovisions the ALB and assigns a new DNS name on recreate, which breaks DNS records and OIDC redirect URIs.
cluster-autoscalerkube-systemYesScales EC2 node groups based on pod scheduling pressure.
ebs-csi-driverkube-systemYesProvisions EBS volumes for PersistentVolumeClaims (used by ClickHouse).
KEDAkedaNoKubernetes Event-driven Autoscaling. Scales queue and ingest-queue on Redis queue depth. Required for the LangSmith Deployment add-on.
cert-managercert-managerOptional (Route 53 IRSA when letsencrypt)Automates TLS certificate issuance. Installed always; active for Let’s Encrypt only.
External Secrets Operatorexternal-secretsYesSyncs SSM parameters into the langsmith-config Kubernetes Secret.

IRSA roles

IRSA replaces static credentials. The EKS cluster’s OIDC issuer is the trust anchor; service accounts in langsmith and kube-system are annotated with role ARNs and pods receive temporary credentials via the EKS token webhook.
RoleDefined inUsed byPermissions
langsmith_irsa_rolemodules/eksbackend, platform-backend, queue, ingest-queue, host-backend, listeners3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket on the LangSmith bucket
aws_iam_role.esoaws/infra/main.tfESO controllerssm:GetParameter, ssm:GetParameters on /langsmith/*

Network topology

Default — ALB ingress

Internet
  → AWS Application Load Balancer (port 80 or 443, TLS via ACM or Let's Encrypt)
    → EKS Cluster (private subnets)
      • kube-system: aws-load-balancer-controller, cluster-autoscaler, ebs-csi-driver, keda
      • langsmith:   backend, frontend, playground, queue, clickhouse
                     redis (in-cluster) OR ElastiCache (private subnet)
                     RDS PostgreSQL (private subnet)
                     S3 bucket (VPC Gateway Endpoint, no public route)

Envoy Gateway — opt-in

Internet
  → AWS Network Load Balancer (NLB, ACM TLS termination at 443)
    → envoy-gateway-system: Envoy proxy (GatewayClass: eg, Gateway: langsmith-gateway)
      → langsmith namespace:        backend, frontend, playground, queue, clickhouse, ...
      → langsmith-agents namespace (optional dataplane): langgraph-dataplane listener + operator + agent pods
         (HTTPRoute attaches to shared langsmith-gateway via allowedRoutes: All)

Egress path with Network Firewall

When create_firewall = true, all outbound internet traffic from private subnets is inspected before reaching the NAT gateway:
EKS pods / RDS / ElastiCache (private subnets)
  → AWS Network Firewall (TLS SNI + HTTP Host inspection)
     ALLOWLIST: firewall_allowed_fqdns (default: beacon.langchain.com)
     DROP: all other established connections
  → NAT Gateway (public subnet)
  → Internet
Pod-to-pod, pod-to-RDS, and pod-to-ElastiCache traffic uses the local VPC route and never touches the firewall.

Ingress options

Four mutually exclusive ingress options ship with the modules. The choice determines whether split dataplane (agent pods in a separate namespace) is supported.
OptionVariableSplit dataplaneTraffic pathWhen to use
ALB (AWS LBC)defaultNoALB → frontend NodePortDefault. Single-namespace deployments, POC, simplest TLS via ACM.
NGINX Ingressenable_nginx_ingress = trueNoALB → TGB → NGINX controller → frontend ClusterIPWhen NGINX is the standard ingress in your organization.
Envoy Gatewayenable_envoy_gateway = trueYesALB → TGB → Envoy proxy:10080 → HTTPRoute → servicesCross-namespace HTTPRoute routing. Recommended for split dataplane on new AWS deployments.
Istioenable_istio_gateway = trueYesALB → TGB → istio-ingressgateway:80 → VirtualService → servicesClusters with Istio already installed, or when an mTLS mesh is required.

Why ALB cannot support split dataplane

Standard Kubernetes Ingress is namespace-scoped. The ALB controller routes only to services in the same namespace as the Ingress resource. Agent pods in langsmith-agents are invisible to an Ingress in langsmith. Envoy Gateway and Istio both support cross-namespace routing via the Kubernetes Gateway API.

ALB plus Envoy Gateway (chained)

When the existing ALB already provides SSO (Okta or Cognito OIDC), WAF, and TLS, Envoy Gateway slots in behind it instead of replacing it:
Internet
  → ALB (unchanged: WAF, SSO, TLS, DNS)
    → Envoy Gateway NLB (internal-scheme, auto-provisioned by k8s-bootstrap)
       → HTTPRoute → langsmith namespace        (control plane)
       → HTTPRoute → langsmith-agents namespace (split dataplane)
The only change from the default ALB path is retargeting the ALB target group to the Envoy NLB. See helm/values/examples/langsmith-values-ingress-envoy-gateway.yaml in the modules repo for the values overlay.

TLS and DNS

The tls_certificate_source variable controls the certificate strategy:
ModeBehaviorCompatible gateways
noneHTTP only, no certificateAny
acmHTTPS:443 with HTTP→HTTPS redirect. ACM certificate, auto-provisioned or BYO.ALB, NGINX
letsencryptHTTPS via cert-manager + Let’s Encrypt DNS-01 (Route 53 IRSA)Istio, Envoy

Why ACM versus cert-manager

ACM certificates are non-exportable. AWS attaches them directly to the ALB, which makes ACM the right choice when TLS terminates at the ALB. ACM cannot be used when TLS terminates inside the cluster (Istio Gateway, Envoy Gateway) because those gateways require the certificate material as a Kubernetes Secret. cert-manager handles in-cluster TLS for Istio and Envoy. The letsencrypt value is a reference implementation: it installs cert-manager and a Let’s Encrypt ACME ClusterIssuer. In production, swap the ClusterIssuer for any cert-manager-compatible issuer.
IssuerWhen to use
Let’s Encrypt (default)Public domain, internet access, free
ACM Private CA (aws-privateca-issuer)AWS-native, air-gap friendly, private domains, paid
Venafi (cert-manager-venafi)Enterprise PKI, regulated environments
HashiCorp Vault (cert-manager-vault)Self-hosted PKI
DigiCert, Sectigo, othersACME or custom issuer plugins
The Terraform module provisions the cert-manager IRSA role and Route 53 permissions. Only the ClusterIssuer manifest changes between issuers.

Auto-provisioned DNS

When langsmith_domain is set and acm_certificate_arn is empty, Terraform activates the dns module which creates:
  • A Route 53 hosted zone for the domain.
  • An ACM certificate with DNS validation records.
  • A Route 53 alias record pointing the domain to the ALB.
Staged deploy pattern: Set langsmith_domain with tls_certificate_source = "none" first. Terraform creates the hosted zone and certificate without blocking on validation. Delegate the NS records at your registrar, then flip to tls_certificate_source = "acm" in a later apply. Terraform blocks until the certificate validates and wires it into the HTTPS listener.

Bring your own certificate

Set acm_certificate_arn directly to skip the dns module. For in-cluster gateways, create a Kubernetes TLS Secret manually and reference it in the Gateway or VirtualService.

Module dependency graph

vpc ─► firewall (optional, create_firewall = true)

├─► eks ─► k8s-bootstrap (KEDA, ESO, Envoy Gateway [opt-in])
│            └─► cert-manager (Let's Encrypt DNS-01 via Route 53 IRSA)

├─► postgres    (RDS, private subnets from VPC)
├─► redis       (ElastiCache, private subnets from VPC)
├─► storage     (S3 bucket + VPC Gateway Endpoint)
├─► alb         (pre-provisioned ALB, public subnets)
│     └─► alb_access_logs (S3 bucket for access logs, opt-in)
├─► dns         (Route 53 zone + ACM cert, optional)
├─► bastion     (jump host for private EKS access, optional)
├─► cloudtrail  (audit logging, optional)
├─► waf         (WAF ACL on ALB, optional)
└─► firewall    (Network Firewall egress filter, optional)
       all ─► langsmith (root module)

Opt-in security modules

ModuleVariableDefaultPurpose
Network Firewallcreate_firewallfalseFQDN-based egress filtering. Allows only domains in firewall_allowed_fqdns (TLS SNI + HTTP Host). Requires create_vpc = true. Cost ≈ $0.40/hr/endpoint + $0.065/GB processed.
ALB access logsalb_access_logs_enabledfalseTraffic analysis and compliance
CloudTrailcreate_cloudtrailfalseAPI call logging. Skip if an organization trail already exists.
WAFcreate_waffalseWAFv2 Web ACL — OWASP Top 10, IP reputation, known bad inputs

Default resource sizes

ResourceDefaultvCPUMemory
EKS nodem5.4xlarge1664 GB
RDS PostgreSQLdb.t3.large28 GB
ElastiCache Rediscache.m6g.xlarge413.07 GB
RDS storage10 GB
For production sizing recommendations, see the scaling guide and the AWS deployment guide.

Validated behaviors and known constraints

These constraints were validated during the April 2026 gateway permutation test run.
#AreaConstraint or fix
1ACM wildcard SANslangchain.com has 0 issue "amazon.com" CAA but not 0 issuewild "amazon.com". Wildcard SANs fail with CAA_ERROR. The dns module requests only the apex domain.
2In-cluster RedisThe LangSmith Helm chart deploys Redis without requirepass. The k8s_bootstrap module writes redis://langsmith-redis:6379. Do not add an auth token unless you also configure the Helm chart Redis values.
3name_prefix lengthMaximum 15 characters. Names like dz-nginx-tst (12 characters) are valid.
4Istio portIstio 1.23+ ingressgateway listens on port 80 via NET_BIND_SERVICE, not port 8080. ALB TGB health check and security group rules must target port 80.
5NGINX TGB portNGINX ingress-nginx controller pods listen on port 80. The TargetGroupBinding target type is ip.
6Envoy proxy portEnvoy proxy pods listen on port 10080 (not 80) when running as non-root. The TGB servicePort must be 10080.
7Destroy orderAlways run terraform destroy first and let Terraform handle namespace and Helm release lifecycle. Pre-deleting namespaces causes the helm_release resource to time out because Helm cannot uninstall cleanly into a terminating namespace.
8Stuck terminating namespacesKEDA’s stale external.metrics.k8s.io/v1beta1 API group causes NamespaceDeletionDiscoveryFailure. Fix: kubectl delete apiservice v1beta1.external.metrics.k8s.io before re-running terraform destroy.

Verification commands

# EKS cluster status
aws eks describe-cluster --name <cluster-name> --query "cluster.status"

# Node health
kubectl get nodes -o wide

# ALB status
kubectl get ingress -n langsmith

# RDS status
aws rds describe-db-instances \
  --query "DBInstances[?DBInstanceIdentifier=='<db-id>'].DBInstanceStatus"

# ElastiCache status
aws elasticache describe-replication-groups \
  --query "ReplicationGroups[?ReplicationGroupId=='<group-id>'].Status"

# S3 access from a pod (via VPC endpoint)
kubectl run s3-test --rm -it --image=amazon/aws-cli -n langsmith -- \
  aws s3 ls s3://<bucket-name>