Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-featse-1779998369-ad736a3.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This page documents what the Azure Terraform modules provision and how the modules wire the resulting deployment together.

Platform layers

LangSmith on Azure deploys in stages. Each stage adds a capability layer on top of the previous. All layers share the same AKS cluster and langsmith namespace. LangSmith on Azure service layout
StageLayerWhat it adds
InfrastructureAzure infrastructureVNet, AKS, Postgres, Redis, Blob, Key Vault, cert-manager, KEDA, ingress controller
ApplicationLangSmith basefrontend, backend, platform-backend, queue, ingest-queue, ace-backend, clickhouse, playground
LangSmith Deployment add-onLangSmith Deploymenthost-backend, listener, operator + per-deployment pods
Agent Builder add-onAgent Builderagent-builder-tool-server, agent-builder-trigger-server + deep-agent LGP
Insights + Polly add-onInsights + PollyClio analytics (ClickHouse-backed), Polly eval agent (operator-managed, dynamic)

Application deployment paths

PathHowWhen to use
Helm pathmake init-values && make deployDefault. Shell script, interactive, reads TF outputs dynamically. Best for first deploys and day-2 re-deploys.
Terraform pathmake init-app && make apply-appDeclarative. Kubernetes Secrets + langsmith-ksa SA + Helm release in Terraform state. Best for GitOps and CI/CD pipelines.
The Terraform path uses the app/ module. make init-app calls app/scripts/pull-infra-outputs.sh to read all infra outputs and write them into app/infra.auto.tfvars.json.

Deployment tiers

Light deploy (all in-cluster)

AKS Cluster
├── langsmith namespace
│   ├── frontend, backend, platform-backend, playground, queue, ace-backend
│   ├── clickhouse (in-cluster pod)
│   ├── postgres   (in-cluster pod)
│   └── redis      (in-cluster pod)
├── ingress-nginx (Azure Load Balancer → NGINX)
└── cert-manager  (Let's Encrypt TLS)

Azure
├── Azure Blob Storage  (trace payloads — always external)
└── Azure Key Vault     (secrets)
Set in terraform.tfvars:
postgres_source   = "in-cluster"
redis_source      = "in-cluster"
clickhouse_source = "in-cluster"
For the full all-in-cluster walkthrough (Front Door TLS, all-in-cluster DBs), see BUILDING_LIGHT_LANGSMITH.md in the Azure module repo.

Production (external managed services)

AKS Cluster
├── langsmith namespace
│   ├── frontend, backend, platform-backend, playground, queue, ingest-queue, ace-backend
│   └── clickhouse (in-cluster — use LangChain Managed for production scale)
└── ingress-nginx + cert-manager

Azure Managed Services
├── Azure DB for PostgreSQL Flexible Server (private VNet)
├── Azure Cache for Redis Premium (private VNet)
├── Azure Blob Storage (Workload Identity — no static keys)
└── Azure Key Vault

Networking

Light deploy

langsmith-vnet<identifier>
└── subnet-0    (AKS nodes only)
    No Postgres/Redis subnets — chart-managed pods handle both

Production

langsmith-vnet<identifier>
├── subnet-0              (AKS nodes)
├── subnet-postgres       (Azure DB for PostgreSQL Flexible Server)
└── subnet-redis          (Azure Cache for Redis Premium)
All subnets are private. Postgres and Redis are accessible only from within the VNet via private DNS resolution. No public endpoints.

Application core services

ServicePurposePortHPAWorkload Identity
langsmith-frontendReact UI30001 to 10No
langsmith-backendMain API (traces, runs, projects, API keys, feedback)19843 to 10Yes (Blob)
langsmith-platform-backendOrg and user management, auth, billing, settings19861 to 10Yes (Blob)
langsmith-playgroundLLM prompt playground UI30011 to 10No
langsmith-queueTrace ingestion worker (Redis → ClickHouse + Blob)3 to 10 + KEDAYes
langsmith-ingest-queueDedicated high-throughput ingestion worker3 to 10 + KEDAYes
langsmith-ace-backendAsync compute (dataset runs, evaluations, background jobs)1 to 5No
langsmith-clickhouseColumnar store (trace spans, run metadata, eval results)StatefulSet, single replica, 500Gi PVCNo
In-cluster ClickHouse is dev/POC only (single pod, no replication, no backups). For production use LangChain Managed ClickHouse or a self-managed external cluster.

One-time jobs

JobPurpose
langsmith-backend-migrationsPostgreSQL schema migrations
langsmith-backend-ch-migrationsClickHouse schema migrations
langsmith-backend-auth-bootstrapCreates the initial org and admin account from initial_org_admin_password in langsmith-config-secret

LangSmith Deployment add-on

ServicePurposeWorkload Identity
langsmith-host-backendLangGraph control plane API. Manages deployment lifecycle, serves deployment metadata.Yes
langsmith-listenerWatches host-backend for state changes, creates and updates LangGraphPlatform CRDs.Yes
langsmith-operatorKubernetes operator. Azure-specific: injects azure.workload.identity/use: "true" + langsmith-ksa so every agent pod accesses Blob Storage via Workload Identity.No

Agent Builder add-on

PodTypeRoleWorkload Identity
langsmith-agent-builder-tool-serverStaticMCP tool execution serverYes
langsmith-agent-builder-trigger-serverStaticWebhook receiver and scheduled trigger engineYes
langsmith-agent-bootstrapJobRegisters the bundled Agent Builder agent
agent-builder-<hash> + queue + redis + lg-<hash>-0DynamicAgent Builder deployment, operator-managedInherited

Insights and Polly add-on

Insights/Clio: No static pods. Deploys lazily as a dynamic LangGraph deployment via the operator on first UI invocation. Reads insights_encryption_key from langsmith-config-secret. Never rotate this key — it permanently breaks existing Insights data. Polly: Runs as a dynamic LangGraph deployment. Resource limits 2 CPU / 4 Gi request, 4 CPU / 8 Gi limit, scales 1 to 5 replicas. Reads polly_encryption_key from langsmith-config-secret. Same rotation warning as Insights.

Azure managed services

When postgres_source = "external" and redis_source = "external" (the recommended production setting), Terraform provisions:

Azure DB for PostgreSQL Flexible Server

  • Holds orgs, users, projects, API keys, settings.
  • PostgreSQL ≥ 14 required (Azure Flexible Server defaults to 16).
  • Extensions enabled automatically by the postgres module: btree_gin, btree_gist, pgcrypto, citext, ltree, pg_trgm.
  • Private VNet only (subnet-postgres), SSL port 5432.
  • Secret: langsmith-postgres-secret, created by the k8s-bootstrap Terraform module.

Azure Cache for Redis Premium

  • Trace ingestion queue, pub/sub, short-lived cache.
  • Redis ≥ 5 required (Premium tier defaults to Redis 6).
  • Each LangSmith installation must use its own dedicated Redis. Shared instances cause deployment tasks to route incorrectly.
  • Private VNet only (subnet-redis), TLS port 6380.
  • Secret: langsmith-redis-secret, created by the k8s-bootstrap Terraform module.

Azure Blob Storage

  • Trace payloads: large inputs and outputs, attachments.
  • Workload Identity (no static keys) via the k8s-app-identity Managed Identity.
  • Always required. Disabling blob storage breaks the cluster on large payloads.
  • Prefixes: ttl_s/ (14-day TTL), ttl_l/ (400-day TTL).

Azure Key Vault

  • Centralized secret store for all LangSmith secrets.
  • Secret flow: az keyvault secret showkubectl create secret generic langsmith-config-secret.

Workload Identity

Azure AD token exchange happens via the AKS OIDC issuer. Pods access Blob Storage without static keys.
AKS OIDC issuer
  → Federated credential on Azure Managed Identity (one per Kubernetes ServiceAccount)
  → Kubernetes ServiceAccount annotated with azure.workload.identity/client-id
  → Pod labeled with azure.workload.identity/use: "true"
  → Azure AD issues a short-lived token — no storage keys in any Secret or env var
Workload Identity is centralized in modules/k8s-cluster/ alongside the managed identity and OIDC issuer, which avoids circular dependencies and simplifies adding new ServiceAccounts.

Which pods need Workload Identity

Every pod that reads blob storage env vars must have:
  1. A federated credential registered in Terraform (modules/k8s-cluster/main.tf).
  2. The azure.workload.identity/use: "true" label on the Deployment.
  3. The azure.workload.identity/client-id annotation on the ServiceAccount.
PodStageNeeds WI
langsmith-backendApplicationYes
langsmith-platform-backendApplicationYes
langsmith-queueApplicationYes
langsmith-ingest-queueApplicationYes
langsmith-host-backendLangSmith Deployment add-onYes
langsmith-listenerLangSmith Deployment add-onYes
langsmith-agent-builder-tool-serverAgent Builder add-onYes
langsmith-agent-builder-trigger-serverAgent Builder add-onYes
langsmith-frontendApplicationNo
langsmith-playgroundApplicationNo
langsmith-ace-backendApplicationNo
langsmith-clickhouseApplicationNo
langsmith-operatorLangSmith Deployment add-onNo
All federated credentials are registered in modules/k8s-cluster/main.tf under service_accounts_for_workload_identity. Adding a new pod that accesses blob storage requires adding its ServiceAccount name to that list and running terraform apply -target=module.aks.

What breaks without it

panic: blob-storage health-check failed: get container properties failed:
DefaultAzureCredential: failed to acquire a token.
WorkloadIdentityCredential authentication failed.
  AADSTS700213: No matching federated identity record found for presented assertion subject
The pod panics on startup — the ServiceAccount has no registered federated credential so Azure AD rejects the token exchange.

Secret flow

Infrastructure stage

  ./setup-env.sh   (read-only against Key Vault — never writes to KV directly)
    First run:  prompts for postgres password, license key, admin password.
                Generates api_key_salt, jwt_secret, Fernet keys locally.
                Key Vault does not exist yet → writes to local dot-files + secrets.auto.tfvars.
    Subsequent: Key Vault exists → reads all secrets from KV → writes to secrets.auto.tfvars.
                No prompts, no generation, no KV writes.
    Output:     secrets.auto.tfvars  (gitignored, chmod 600)
                Terraform picks this up automatically — no shell session coupling.

  terraform apply
    Reads:  terraform.tfvars (non-sensitive config)
            secrets.auto.tfvars (sensitive values — sole input for KV secret creation)
    Creates: Azure Key Vault + all secrets as KV secrets (Terraform is the sole KV writer)

Application stage

  ./setup-env.sh   (re-run on any machine to refresh secrets.auto.tfvars from Key Vault)

  kubectl create secret generic langsmith-config-secret
    Reads:  Key Vault secrets + Terraform outputs (postgres/redis URLs, blob account)
    Writes: K8s secrets — langsmith-config-secret, langsmith-postgres-secret,
                          langsmith-redis-secret

  helm upgrade --install langsmith ...
    Chart reads config.existingSecretName = "langsmith-config-secret".
    No secrets inline in any YAML file.
Key rule: secrets.auto.tfvars is never committed. It is regenerated from Key Vault on any machine by running ./setup-env.sh. Terraform is the sole writer to Key Vault; setup-env.sh only reads from it after the first apply.

Ingress options

ControllerVariableDNS label supportNotes
nginx (default)ingress_controller = "nginx"YesNGINX via Helm, standard Kubernetes Ingress.
istio-addoningress_controller = "istio-addon"YesAKS managed Istio service mesh. Use istio_addon_revision to pin revision.
istioingress_controller = "istio"YesSelf-managed Istio via Helm. Full control over revision and config.
agicingress_controller = "agic"YesAzure Application Gateway v2 + AGIC Helm chart. Native L7 WAF. HTTP-only or dns01 + custom domain.
envoy-gatewayingress_controller = "envoy-gateway"YesGateway API native. Uses envoyproxy/gateway-helm.
noneingress_controller = "none"Bring your own ingress.
Azure Public IP DNS labels (dns_label) work with all controllers. deploy.sh applies the service.beta.kubernetes.io/azure-dns-label-name annotation to the correct LoadBalancer service based on the chosen controller. For the full TLS compatibility matrix and per-controller setup, see INGRESS_CONTROLLERS.md in the Azure module repo.

Resource sizing

Four sizing profiles are available.
ProfileUse caseSet via
minimumCost parking, CI smoke tests, single-user demossizing_profile = "minimum" in terraform.tfvars
devDeveloper use, integration tests, POCssizing_profile = "dev"
productionReal traffic — multi-replica + HPAsizing_profile = "production" (recommended)
production-large~50 users, ~1000 traces/secsizing_profile = "production-large"

AKS node pools

PoolVM SizevCPURAMMinMaxPurpose
defaultStandard_D8s_v3832 GB310Core LangSmith, system pods
largeStandard_D16s_v31664 GB02ClickHouse (in-cluster), LGP agent pods
ClickHouse (when in-cluster) requests 2 to 4 CPU and 8 to 15 GB RAM depending on profile. With LangChain Managed ClickHouse, the large pool is only needed for LGP operator-spawned agent pods.

Optional modules

Each module is count-controlled (0 disabled, 1 enabled). Enable any combination; the core deployment (Passes 1 to 5) works without them.
ModuleVariableUse case
wafcreate_waf = trueAzure WAF policy (OWASP 3.2 + bot protection). Attach to Application Gateway.
diagnosticscreate_diagnostics = trueLog Analytics workspace + diagnostic settings for AKS, Key Vault, Blob. Recommended for production observability.
bastioncreate_bastion = trueAzure Bastion (Standard tier). Browser-based SSH to node VMs without a public IP.
dnscreate_dns_zone = trueAzure DNS zone + A record. Required for DNS-01 cert issuance with a custom domain.