This page documents common issues, fixes, and diagnostic commands for LangSmith deployments provisioned with the Azure Terraform modules.Documentation Index
Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-featse-1779998369-ad736a3.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Infrastructure stage
K8sVersionNotSupported — version is LTS-only
Symptom
eastus. Standard tier clusters must use 1.33+.
Fix: Update kubernetes_version to a version with KubernetesOfficial support:
kubernetes_version pin in terraform.tfvars, then make apply. Existing clusters on 1.32 continue to run; this only blocks new cluster creation.
vCPU quota exceeded
Symptom — autoscaler backoff (pods Pending):standardDSv3Family in eastus is often 10 cores. One Standard_D8s_v3 node uses 8; only 2 remain.
Why max_pods = 30 triggers it: AKS default is 30 pods per node. The base LangSmith install alone deploys ~37 pods. The autoscaler tries to add a second node, hits quota, enters backoff. Fix: default_node_pool_max_pods = 60 in terraform.tfvars so all pods fit on one node.
Recommended quota for multi-dataplane (3 dataplanes): 32 cores.
Request a quota increase:
Standard_DS4_v2 (default) + Standard_DS5_v2 (large). Same vCPU, slightly less RAM. Validated for the full LangSmith install plus all add-ons.
max_pods is immutable on an existing node pool. Set it before the first terraform apply.Istio addon revision not supported
Symptom:terraform apply rejects the Istio revision (Revision asm-1-XX is not supported). Azure retires old ASM revisions regularly.
Fix: Check currently available revisions and update istio_addon_revision:
terraform.tfvars and re-apply.
Key Vault purge protection cannot be disabled after enabling
Symptomterraform destroy, Azure soft-deletes it for 90 days. The next terraform apply with the same name silently recovers the old Key Vault — including its original purge_protection_enabled = true. Purge protection is one-way (enabled → cannot be disabled).
Fix — accept purge protection (test environments):
purge_protection = false:
Key Vault secrets already exist but are not in Terraform state
Symptomsetup-env.sh versions wrote Fernet keys directly to Key Vault. Current setup-env.sh is read-only against Key Vault; Terraform is the sole writer.
Fix: Import the conflicting secrets:
Application stage
dns_label subdomain not resolving — TLS cert stuck pending
Symptom: nslookup langsmith-demo.eastus.cloudapp.azure.com returns NXDOMAIN. The cert-manager ACME challenge cannot complete; TLS certificate stays READY: False.
Cause: The service.beta.kubernetes.io/azure-dns-label-name annotation must be set on the NGINX LoadBalancer service so Azure assigns the DNS label to the public IP. make deploy sets it automatically via deploy.sh. If you ran helm upgrade directly, the annotation was never set.
Fix
istio-addon — port 80/443 timeout, TLS handshake reset
Symptom: Site unreachable after make deploy with ingress_controller = "istio-addon". Port 80 times out, port 443 resets. ACME challenge stays pending.
Cause — three compounding issues:
- Wrong gateway label. Kubernetes Ingress with
ingressClassName: istiotargets pods with labelistio: ingressgateway. The AKS managed external gateway usesistio: aks-istio-ingressgateway-external. ClusterIssuercreated withclass: nginx. The ACME HTTP-01 solver ingress gets classnginx, notistio.- TLS secret in wrong namespace. Istio SDS reads from the gateway pod namespace (
aks-istio-ingress), not the app namespace (langsmith).
make deploy handles all three automatically in the current scripts. If deploying manually, create an Istio Gateway targeting istio: aks-istio-ingressgateway-external, patch the ClusterIssuer solver to ingressClassName: istio, sync langsmith-tls to the aks-istio-ingress namespace, and create a VirtualService routing to the LangSmith frontend. See the TROUBLESHOOTING.md source for the full YAML.
letsencrypt-prod ClusterIssuer missing
Symptom: kubectl describe certificate langsmith-tls -n langsmith shows clusterissuers.cert-manager.io "letsencrypt-prod" not found.
Cause: Older versions of the k8s-bootstrap module did not create the ClusterIssuer. Current versions do; make deploy also applies it via kubectl apply (since kubernetes_manifest requires a live cluster API during plan).
Fix — apply manually:
database "langsmith" does not exist — backend pods crashlooping
Symptom: Backend pods crash immediately: FATAL: database "langsmith" does not exist.
Cause: Azure DB for PostgreSQL Flexible Server does not auto-create application databases. The Terraform postgres module now creates the database via azurerm_postgresql_flexible_server_database. This error means you are on an older module version missing that resource.
Fix
langsmith-backend-auth-bootstrap stuck in CreateContainerConfigError
Cause: The Job reads the admin password using key initial_org_admin_password. If the Secret was created with a different key name (for example admin_password), the container cannot start.
Fix
Cannot roll back to an older chart version
Cause: LangSmith DB migrations are forward-only. Downgrading the chart leaves the DB at a revision the older app image cannot locate. Fix: Roll forward to the version you were on (or newer). Setlangsmith_helm_chart_version in terraform.tfvars and re-deploy. Always test new chart versions in a separate environment before upgrading production.
Helm install times out
Cause:langsmith-backend-auth-bootstrap runs DB migrations on every helm upgrade; first install takes up to 5 minutes. Without --timeout 15m, Helm reports failure even though the install eventually succeeds.
Fix: make deploy already uses --timeout 20m. Running Helm manually, always include --timeout 20m.
Add-ons
Pods stay in DEPLOYING, never reach HEALTHY
Cause: config.deployment.url was empty or config.deployment.tlsEnabled was false when TLS is enabled. The operator builds agent endpoint URLs from these values.
Fix: init-values.sh automatically injects url and tlsEnabled after copying from examples. If deploying manually:
Insights add-on: backend-ch-migrations in CreateContainerConfigError
Symptom: Multiple pods fail with CreateContainerConfigError after enabling enable_insights = true. Logs: secret "langsmith-clickhouse" not found.
Cause: The example langsmith-values-insights.yaml sets clickhouse.external.enabled: true with existingSecretName: langsmith-clickhouse. This overrides the in-cluster ClickHouse configuration and expects an external secret that does not exist.
Fix: init-values.sh now generates a minimal Insights file when clickhouse_source = "in-cluster". For an existing deployment with this issue:
Polly shows “Unable to connect to LangGraph server”
Symptom: Polly chat widget shows connection error. Browser console:POST http://localhost:8123/threads net::ERR_FAILED and CORS error.
Cause A — Frontend started before langsmith-polly-config was created. The bootstrap job creates the ConfigMap with VITE_POLLY_DEPLOYMENT_URL after Polly is registered. Env vars from ConfigMap load at pod start, not dynamically.
Fix
LANGCHAIN_ENDPOINT set in polly.agent.extraEnv. LANGCHAIN_ENDPOINT is reserved. Setting it causes the bootstrap job to fail with 400 Bad Request: 'LANGCHAIN_ENDPOINT' is reserved. Polly is never created.
Fix: Remove the polly.agent.extraEnv block entirely. The operator injects LANGCHAIN_ENDPOINT automatically.
listener and operator pods never appear after enabling LangSmith Deployment
Cause: config.deployment.url was set but config.deployment.enabled: true was omitted. The chart silently skips creating listener and operator when enabled is false (the default).
Fix: Add enabled: true inside the deployment block:
Duplicate top-level config: key silently drops values
Cause: YAML disallows duplicate top-level keys. A second config: block silently drops one of them.
Fix: Always add new config blocks inside the existing config: key. Verify with helm get values langsmith -n langsmith.
Encryption keys must not change after first deploy
Changingdeployments_encryption_key, agent_builder_encryption_key, or insights_encryption_key after their first use permanently corrupts the data they protect. There is no recovery.
- Do not rotate these keys.
- Do not set
config.agentBuilder.encryptionKeyorconfig.insights.encryptionKeyinline invalues-overrides.yaml. The chart reads them fromlangsmith-config-secretviaexistingSecretName. Setting inline overrides the secret reference.
agent-builder-tool-server or polly in CrashLoopBackOff
Symptom: Pod restarts indefinitely. No traceback. Logs show “Child process died” repeatedly.
Cause: lc_config.settings.SharedSettings is instantiated at module import time inside the uvicorn worker. A pydantic ValidationError raised there exits the worker with code 0; uvicorn’s parent prints “Child process died” but swallows the traceback. Common triggers: BASIC_AUTH_ENABLED = true but BASIC_AUTH_JWT_SECRET is empty, or a required feature-flag key absent from langsmith-config.
Diagnose by running the server in a debug pod with envFrom pointing at langsmith-config and PYTHONUNBUFFERED=1. Fix: add the missing key to Key Vault, rerun make k8s-secrets, restart the deployment.
Workload Identity
Pod panics: AADSTS700213: No matching federated identity record found
Symptom
service_accounts_for_workload_identity in modules/k8s-cluster/main.tf:
Teardown and cleanup
make clean before make destroy orphans infrastructure
Symptom: make destroy after make clean fails with No state file was found!. Azure resources still run but Terraform has lost tracking.
Cause: make clean removes terraform.tfvars and secrets.auto.tfvars. Without them, Terraform cannot initialize the backend.
Correct teardown order
identifier afterwards, Azure may recover the soft-deleted Key Vault on the next terraform apply. With keyvault_purge_protection = false, purge first: az keyvault purge --name langsmith-kv<identifier> --location <region>.
terraform destroy stalls on VNet/subnet deletion
Cause: The Azure Load Balancer provisioned by ingress-nginx-controller is not tracked by Terraform. Azure blocks VNet deletion while the LB holds a subnet reference.
Fix: Run make uninstall first.
langsmith-agent-bootstrap hook times out
Symptom: Helm post-upgrade hook times out (context deadline exceeded). Agents progress through QUEUED → AWAITING_DEPLOY → DEPLOYING but do not reach HEALTHY in 20 minutes.
Cause: On a cold cluster, three LGP agents (agent-builder, clio, smith-polly) can take longer than 20 minutes for first image pulls. The Helm hook waits synchronously.
Fix: Not actually a failure. Resources are applied; agents continue deploying. Wait until pods stabilize, then re-run make deploy.
listener pods OOMKilled
Cause: langsmith-values-sizing-dev.yaml sets listener.deployment.resources.limits.memory: 512Mi. With Deployments enabled, the listener exceeds this.
Fix: The langsmith-values-agent-deploys.yaml overlay correctly sets 4Gi. Re-run make init-values to regenerate overlays.
The chart uses
listener.deployment.resources for container limits, not listener.resources. Setting listener.resources in an overlay is silently ignored.Stale HPA scales listener or host-backend to max replicas
Cause: A prior Helm revision created an HPA. Helm does not clean it up on failed hooks. On re-deploy with enabled: false, the stale HPA remains and overrides replicas.
Fix
AGIC (Application Gateway Ingress Controller)
AGIC pod CrashLoopBackOff — 403 on AGW GET
Symptom:ingress-appgw-deployment is CrashLoopBackOff. Logs: ErrorApplicationGatewayForbidden: does not have authorization to perform action Microsoft.Network/applicationGateways/read.
Cause: AKS creates a managed identity for the AGIC add-on (ingressapplicationgateway-<cluster> in the MC_ resource group). The identity is created during cluster provisioning but takes ~5 minutes to register in Azure AD before role assignments take effect.
Fix: The k8s-cluster module waits 300s after cluster creation (time_sleep.agic_identity_propagation) and creates the three required role assignments automatically. If AGIC is still 403 after make apply:
AGIC — ApplicationGatewayInsufficientPermissionOnSubnet
Cause: AGIC add-on identity missing Network Contributor on the VNet.
Fix
AGIC — SecretNotFound for TLS secret
Cause: AGIC saw the Ingress before cert-manager issued the TLS certificate.
Fix: Touch the Ingress to trigger re-sync:
AGIC — ingressClassName: azure/application-gateway rejected
Cause: The legacy annotation kubernetes.io/ingress.class: azure/application-gateway (with slash) is not a valid ingressClassName. AKS creates the IngressClass as azure-application-gateway (hyphen).
Fix: Use ingressClassName: azure-application-gateway. make init-values sets this automatically.
Istio (self-managed Helm)
Istio site returns connection refused / no routes
Symptom: Connection refused.pilot-agent request GET config_dump shows LDS: PUSH resources:0.
Root causes (all three must be fixed):
meshConfig.ingressControllerModenot set. Default isDEFAULT, which ignoresingressClassName. Must beSTRICT.istioIngressClass resource missing.meshConfig.ingressClassnot set toistio.
meshConfig is set in the istiod Helm release (Terraform), deploy.sh creates the IngressClass. Manual fix: create the IngressClass and restart istiod.
Istio HTTPS returns “no peer certificate available”
Cause: istiod reads the TLS secret via SDS (kubernetes://langsmith-tls). The secret must exist in istio-system (the gateway pod namespace). cert-manager issues it to the langsmith namespace; it is not copied automatically.
Fix: deploy.sh syncs the secret post-deploy. Manual fix: copy the secret to istio-system.
Leftover CRDs from istio-addon block self-managed Helm install
Symptom: terraform apply fails: CustomResourceDefinition "wasmplugins.extensions.istio.io" exists and cannot be imported into the current release: invalid ownership metadata.
Fix
Diagnostic commands
Cluster access
Pods
Ingress and TLS
Workload Identity
Helm
LangSmith Deployment
Key Vault and Kubernetes Secrets
Quick health check
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

