Provision the Azure cloud foundation and install LangSmith with the public Terraform modules at github.com/langchain-ai/terraform/tree/main/modules/azure. Plan for 40 to 50 minutes end to end on a clean subscription. The deployment runs in two stages: infrastructure (Terraform provisions AKS, Postgres, Redis, Blob Storage, Key Vault, cert-manager, KEDA, ingress) and application (Helm installs the LangSmith chart against the cluster). Three add-ons (LangSmith Deployment, Agent Builder, Insights and Polly) are enabled with flags and a redeploy.Documentation Index
Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-featse-1779998369-ad736a3.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Required tools
| Tool | Version | Purpose |
|---|---|---|
Azure CLI (az) | 2.50 | Authenticate, query Azure resources, manage AKS credentials |
| Terraform | 1.5 | Run the infrastructure modules |
kubectl | 1.28 | Inspect the AKS cluster |
| Helm | 3.12 | Install and manage the LangSmith chart |
Required Azure RBAC
The identity running Terraform needs the following roles on the subscription:| Role | Purpose |
|---|---|
Contributor | Create and manage all Azure resources |
User Access Administrator | Create role assignments for Key Vault, Blob, cert-manager managed identities |
Owner includes both. Contributor alone is insufficient because role assignments require User Access Administrator.
Authenticate
dns_label (Azure subdomain, no DNS setup needed) or a custom langsmith_domain.
Rapid path
For the fastest path from zero to a running LangSmith instance:Provision infrastructure
Provisioning the Azure cloud foundation takes 15 to 20 minutes on a clean subscription. Do not interrupt the apply.What gets provisioned
| Resource | Type | Purpose |
|---|---|---|
| Resource Group | azurerm_resource_group | Container for all resources |
| Virtual Network | azurerm_virtual_network | Isolated network (10.0.0.0/17) |
| AKS Cluster | azurerm_kubernetes_cluster | Kubernetes, all workloads run here |
| Ingress Controller | Helm | External load balancer + TLS termination (nginx by default) |
| PostgreSQL Flexible Server | azurerm_postgresql_flexible_server | Org config, run metadata (external tier) |
| Redis Cache Premium | azurerm_redis_cache | Trace ingestion queue, pub/sub (external tier) |
| Blob Storage | azurerm_storage_account | Raw trace objects, always required |
| Managed Identity | azurerm_user_assigned_identity | Workload Identity for pod-to-Blob auth |
| Azure Key Vault | azurerm_key_vault | Stores all LangSmith secrets |
| cert-manager | Helm | Automated TLS certificate management |
| KEDA | Helm | Event-driven autoscaling for workers |
Clone and configure
modules/azure/. Run make help for the full target list.
Generate terraform.tfvars with the interactive wizard:
Blob Storage is always required, regardless of tier. Trace payloads must go to Azure Blob, never to ClickHouse.
Bootstrap secrets
setup-env.sh writes infra/secrets.auto.tfvars (gitignored, chmod 600). Terraform picks this file up automatically, no shell exports needed.
- First run: prompts for PostgreSQL password, LangSmith license key, admin password, and admin email. Generates
api_key_salt,jwt_secret, and four Fernet encryption keys locally. - Subsequent runs: reads everything silently from Azure Key Vault.
Preflight
terraform.tfvars and secrets.auto.tfvars presence, and terraform/kubectl/helm on PATH.
Apply
Skip
make plan on a fresh deploy. kubernetes_manifest resources require a live cluster API during plan, which does not exist yet. make apply handles resource ordering in three internal stages: Azure resources → AKS → Kubernetes bootstrap.Cluster credentials and Kubernetes Secrets
Aftermake apply completes, get cluster credentials and push secrets into the cluster:
make k8s-secrets reads 8 secrets from Key Vault and creates or updates langsmith-config-secret. Safe to re-run; uses --dry-run=client | kubectl apply to update in place.
Verify infrastructure
Deploy LangSmith
Two deployment paths are supported. Pick one.| Path | Command | When to use |
|---|---|---|
| Helm path (default) | make init-values && make deploy | Interactive output, kubeconfig refresh, preflight checks. Best for first-time deploys and day-2 re-deploys. |
| Terraform path | make init-app && make apply-app | Helm release + Kubernetes Secrets + Workload Identity SA managed in Terraform state. Best for GitOps and CI/CD pipelines. |
Helm path (recommended)
Generate Helm values
make init-values reads terraform output and terraform.tfvars and generates helm/values/values-overrides.yaml with all fields populated:
config.hostname, your FQDN (fromdns_labelorlangsmith_domain).config.initialOrgAdminEmail, the first org admin account.config.existingSecretName: langsmith-config-secret, secrets reference.config.blobStorage, storage account name + container + Workload Identity client ID.- Workload Identity annotations for 5 ServiceAccounts (backend, platform-backend, queue, ingest-queue, host-backend).
- Ingress + TLS block (cert-manager annotation, TLS secret name).
- Postgres and Redis external secret references (when
postgres_source = "external"/redis_source = "external").
helm/values/examples/ into helm/values/.
The admin email is read from
langsmith_admin_email in terraform.tfvars (set during make setup-env) and written into values-overrides.yaml automatically. No manual editing needed.Deploy
make deploy handles:
- Validates
values-overrides.yamlexists. - Refreshes kubeconfig via
az aks get-credentials. - Annotates the LoadBalancer service with
service.beta.kubernetes.io/azure-dns-label-name, required for Azure to assign the DNS label to the public IP. - Creates the
letsencrypt-prodcert-managerClusterIssueriftls_certificate_source = "letsencrypt"(idempotent). - Runs preflight checks (tools, cluster connectivity, Helm repo).
- Verifies
langsmith-config-secretexists; auto-creates from Key Vault if missing. - Builds and logs the values chain.
- Auto-recovers any stuck
pending-upgradeHelm release before proceeding. - Runs
helm upgrade --install langsmith langchain/langsmith --timeout 20m. - Waits for core deployments to roll out.
- Annotates the
langsmith-ksaServiceAccount with the Workload Identity client ID. - Prints the access URL and login credentials location.
Why
--timeout 20m? The langsmith-backend-auth-bootstrap Job runs DB migrations and org initialization as a post-install hook. This takes up to 5 minutes on first install. Without a long timeout, Helm may report failure even though the install eventually succeeds.Terraform path
Use this path when you want the Helm release, Kubernetes Secrets, and Workload Identity ServiceAccount managed in Terraform state.app/terraform.tfvars:
Verify the deployment
https://<HOSTNAME> and log in with the admin email and password from Key Vault:
Values chain
make deploy applies Helm values files in this order (last file wins on conflicts):
helm/values/ are gitignored (generated or contain live secrets). Source templates live in helm/values/examples/ and are copied by make init-values.
Day-2 operations
Enable add-ons
Each add-on is gated by a flag ininfra/terraform.tfvars. Set the flag, re-run make init-values to regenerate values, then re-run make deploy.
LangSmith Deployment
Enables LangSmith Deployment, which lets you deploy and manage LangGraph graphs as API servers directly from the LangSmith UI. Adds three new pods.| Pod | Role | Workload Identity |
|---|---|---|
langsmith-host-backend | LangSmith Deployment control plane API. Manages deployment lifecycle, stores state in shared PostgreSQL. | Yes |
langsmith-listener | Watches host-backend, creates and updates LangGraphPlatform CRDs in Kubernetes. | Yes |
langsmith-operator | Reconciles CRDs. Creates per-deployment Deployments, StatefulSets, and Services. | No |
Scale the node pool first
Before enabling, bumpdefault_node_pool_min_count to at least 5. The operator spawns agent deployment pods on demand and needs node headroom:
Apply, regenerate values, deploy
make init-values appends the LangSmith Deployment add-on overlay (langsmith-values-agent-deploys.yaml) to the values chain. It automatically injects:
Verify
langsmith-host-backend, langsmith-listener, and langsmith-operator all Running. Total pod count: ~20 Running + 3 Completed jobs.
KEDA is already installed alongside infrastructure. With enable_deployments = true, the operator creates KEDA ScaledObject resources for each agent deployment’s worker queue. Worker pods scale down to zero when idle and scale up based on Redis queue depth.
Agent Builder
Provides visual AI-assisted creation and management of LangGraph agents from the LangSmith UI. Noterraform apply needed; just make init-values && make deploy.
Prerequisite: LangSmith Deployment enabled (enable_deployments = true). Enabling Agent Builder without it causes a preflight error.
| Pod | Type | Role |
|---|---|---|
langsmith-agent-builder-tool-server | Static | MCP tool execution server, code/file editing tools for the AI |
langsmith-agent-builder-trigger-server | Static | Webhook receiver and scheduled trigger engine |
langsmith-agent-bootstrap | Job (Completed) | Registers the bundled Agent Builder agent through the operator, runs once |
agent-builder-<hash> + queue + redis + lg-<hash>-0 | Dynamic (operator-managed) | Agent Builder deployment, created by the operator when the bootstrap Job runs |
make init-values appends the Agent Builder add-on overlay (langsmith-values-agent-builder.yaml) to the values chain. The overlay enables the Agent Builder UI and supporting services, sets backend.agentBootstrap: true (the post-install job that registers Agent Builder as a LangSmith Deployment and creates the required ConfigMap), and sets conservative agent worker pod resources (1 CPU / 1 Gi) instead of the chart’s default 4 CPU / 8 Gi.
Verify:
make deploy, an Agent Builder section appears in the LangSmith UI navigation.
Both langsmith-agent-builder-tool-server and langsmith-agent-builder-trigger-server need Workload Identity to access Azure Blob Storage. Their federated credentials are pre-registered in modules/k8s-cluster/main.tf; no additional setup is needed.
Insights and Polly
Two features, both of which require LangSmith Deployment. They are independent of each other; enable either one without the other.- Insights: AI-powered trace analytics (Clio). Surfaces patterns and anomalies in LangSmith traces. Clio deploys as a dynamic LangGraph deployment through the operator on first UI invocation. Adds no new static pods.
- Polly: AI-powered evaluation and monitoring agent. Runs as a dynamic LangGraph deployment. Sets resource limits for the Polly worker (2 CPU / 4 Gi request, 4 CPU / 8 Gi limit, scales 1 to 5 replicas).
terraform apply needed; just make init-values && make deploy.
make init-values appends the add-on overlays based on clickhouse_source in terraform.tfvars:
clickhouse_source = "in-cluster", generates a minimal overlay (config.insights.enabled: trueonly). The Helm chart manages ClickHouse internally.clickhouse_source = "external", generates a full overlay withclickhouse.external.enabled: trueand alangsmith-clickhousesecret reference. Create this secret with the ClickHouse host and credentials before deploying.
Add-on summary
| Phase | New pods | Total ~running |
|---|---|---|
| Base install | Core LangSmith (backend, frontend, queue, ingest-queue, clickhouse, etc.) | ~17 |
| LangSmith Deployment | host-backend, listener, operator | ~20 |
| Agent Builder | tool-server, trigger-server, bootstrap Job + 4 dynamic Agent Builder pods | ~26 |
| Insights and Polly | No new static pods (Clio + Polly appear dynamically on first use) | ~22 at rest |
Ingress controllers
Setingress_controller in terraform.tfvars before make apply. For the full TLS compatibility matrix, see INGRESS_CONTROLLERS.md in the Azure module repo.
| Value | What Terraform installs | Best for |
|---|---|---|
nginx (default) | ingress-nginx Helm chart with Azure LB | Standard deployments. Simplest setup. |
istio-addon | AKS Service Mesh add-on (Azure-managed Istio) | Azure-managed Istio mesh, multi-dataplane, mTLS. |
istio | istio-base + istiod + istio-ingressgateway | Self-managed Istio. Full mesh and sidecar injection. |
agic | Azure Application Gateway v2 + AGIC Helm chart | Enterprise Azure, native L7 WAF, HTTP-only or dns01 + custom domain. |
envoy-gateway | gateway-helm OCI chart, Kubernetes Gateway API | Gateway API native, modern alternative to Ingress. |
DNS and TLS
dns_label gives you a free Azure subdomain, <label>.<region>.cloudapp.azure.com, with no domain registration or DNS zone needed. deploy.sh annotates the correct LoadBalancer service automatically.
Quickstart default (HTTP, zero setup):
make applycreates the Azure DNS zone and outputs 4 nameservers.- At your registrar, add NS records for the subdomain pointing to those 4 nameservers.
- Verify:
dig NS langsmith.mycompany.com @8.8.8.8. make deployissues the cert via DNS-01 automatically (Workload Identity writes the TXT record to Azure DNS).- Get the LB IP, add
ingress_ip = "<ip>"toterraform.tfvars, thenmake apply(creates the A record). make statusshows exactly what NS and A records to add at each stage.
Why NS records, not CNAME: cert-manager must write TXT records to the zone to prove ownership. That requires Azure DNS to be authoritative for the subdomain, and NS delegation grants that authority. A CNAME only aliases traffic and does not transfer DNS authority; the DNS-01 challenge will fail.
Next steps
- Reference the Azure variables and the quick reference.
- Review the Azure architecture for module structure, traffic flow, and Workload Identity.
- When something breaks, check the Azure troubleshooting guide.
- Enable agent deployment in the UI with LangSmith Deployment.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

