Documentation Index
Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-featse-1779998369-ad736a3.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This page documents common issues, fixes, and diagnostic commands for LangSmith deployments provisioned with the GCP Terraform modules.
Before upgrading, review the LangSmith self-hosted changelog for breaking changes and required variable updates. Run gcloud container clusters get-credentials <cluster-name> --region <region> --project <project-id> before running any kubectl commands.
Automated diagnostics
Before running individual commands, try the bundled scripts:
# Full deployment health check + next-step guidance
make status
# Secret Manager validation
make secrets # → manage-secrets.sh validate
Known issues
Symptom
Error 403: ... has not been used in project <project-id> before or it is disabled.
Cause: Required GCP APIs are not enabled. Terraform enables them via google_project_service, but cloudresourcemanager.googleapis.com must already be enabled for Terraform to enable the others.
Fix
gcloud services enable cloudresourcemanager.googleapis.com --project <project-id>
cd modules/gcp/infra
terraform apply -var-file=terraform.tfvars
GKE cluster API server not accessible after apply
Symptom
Error: Get "https://<cluster-endpoint>/api/v1/namespaces": dial tcp: connection refused
Cause: The GKE control plane takes 10 to 15 minutes to become fully operational. Terraform waits for RUNNING then adds a 90-second buffer. Cold-start API activation on slow projects can exceed the window.
Fix: Wait for RUNNING, then re-run:
gcloud container clusters describe <cluster-name> \
--region <region> --project <project-id> --format="value(status)"
terraform apply -var-file=terraform.tfvars
GKE nodes not joining (NotReady)
Symptom: kubectl get nodes shows no nodes or nodes stuck in NotReady.
Cause: Node pool service account lacks roles/container.nodeServiceAccount, or VPC firewall rules block node-to-control-plane communication.
Fix
gcloud container node-pools describe <pool-name> \
--cluster <cluster-name> --region <region> \
--format="value(config.serviceAccount)"
gcloud projects add-iam-policy-binding <project-id> \
--member="serviceAccount:<node-sa-email>" \
--role="roles/container.nodeServiceAccount"
gcloud compute firewall-rules list --filter="network:<vpc-name>"
Cloud SQL connection refused from GKE pods
Symptom: Backend logs show connection refused or no route to host for the Cloud SQL private IP.
Cause: The private service connection (VPC peering) is not established, or the allocated IP range is too small. Often happens when servicenetworking.googleapis.com was not enabled before the networking module ran.
Fix
gcloud services vpc-peerings list --network <vpc-name> --project <project-id>
gcloud sql instances describe <instance-name> --format="value(ipAddresses)"
gcloud compute networks peerings list --network <vpc-name>
If peering is missing, ensure enable_private_service_connection = true and re-apply:
terraform apply -var-file=terraform.tfvars -target=module.networking
terraform apply -var-file=terraform.tfvars
Memorystore Redis connection timeout
Symptom: Pods cannot connect to Redis. Logs show dial tcp: connection timed out or redis: connection refused.
Cause: The Memorystore authorized_network does not match the GKE VPC, or the Redis private IP is on a range not routable from the GKE subnet.
Fix
gcloud redis instances describe <instance-name> --region <region> \
--format="value(host,authorizedNetwork)"
kubectl run redis-test --rm -it --image=redis:7 -n langsmith -- \
redis-cli -h <redis-private-ip> ping
# Expected: PONG
cert-manager fails to issue Let’s Encrypt certificate
Symptom: kubectl get certificate -n langsmith shows READY=False. HTTP01 challenge failing.
Cause: The DNS A record does not point to the Envoy Gateway IP, or port 80 is blocked on the load balancer.
Fix
kubectl get svc -n envoy-gateway-system \
-l gateway.envoyproxy.io/owning-gateway-name=langsmith-gateway \
-o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}'
kubectl describe certificate <cert-name> -n langsmith
kubectl get challenges -n langsmith
kubectl describe challenge -n langsmith
dig +short <your-langsmith-domain>
The DNS A record must resolve to the Gateway IP before the certificate can be issued. cert-manager’s HTTP01 solver needs port 80 to be reachable from the internet.
GCS bucket access denied from LangSmith pods
Symptom: Backend logs show AccessDeniedException: 403 Insufficient Permission or 403 Forbidden when writing to GCS.
Cause: HMAC credentials passed to Helm are incorrect, or the service account that owns the HMAC key lacks roles/storage.objectAdmin on the bucket.
Fix
helm get values langsmith -n langsmith | grep bucketName
gsutil config -a # configure with your HMAC key
gsutil ls gs://<bucket-name>
gcloud storage buckets get-iam-policy gs://<bucket-name>
Create a new HMAC key in GCP Console under Storage → Settings → Interoperability. The key’s service account must have roles/storage.objectAdmin on the bucket.
Envoy Gateway webhook blocking GKE operations
Symptom
Error from server (InternalError): failed calling webhook "validate.gateway.envoyproxy.io"
Cause: The Envoy Gateway admission webhook is not ready or its caBundle is stale.
Fix
kubectl get pods -n envoy-gateway-system
kubectl rollout restart deployment/envoy-gateway -n envoy-gateway-system
kubectl rollout status deployment/envoy-gateway -n envoy-gateway-system
Envoy Gateway external IP changed after re-apply
Symptom: DNS no longer resolves to the correct IP after Terraform re-apply, or existing firewall allowlists stop working.
Cause: The Envoy Gateway external IP is tied to the Gateway Kubernetes resource managed by Terraform. If the resource is deleted and recreated (terraform taint, a module change that forces replacement, or terraform destroy + re-apply), GCP issues a new IP. There is no way to reserve the original IP without pre-allocating a static regional address.
Prevention
- Do not
terraform taint or manually delete the Gateway resource.
- Use
make destroy + make apply only for full teardown and rebuild.
- Before any operation that might recreate the Gateway, note the current IP.
Recovery: Update your DNS A record to the new IP:
kubectl get gateway -n langsmith -o jsonpath='{.items[0].status.addresses[0].value}'
gcloud dns record-sets update <your-domain>. \
--type=A --ttl=300 \
--rrdatas=<new-ip> \
--zone=<zone-name> \
--project=<project-id>
Symptom
Error: googleapi: Error 409: The instance is protected from deletion.
Cause: gke_deletion_protection = true (default) or postgres_deletion_protection = true prevents Terraform from destroying the resources.
Fix
# terraform.tfvars
gke_deletion_protection = false
postgres_deletion_protection = false
terraform apply -var-file=terraform.tfvars
terraform destroy
Workload Identity not working (GCS permission denied)
Symptom
AccessDeniedException: 403 <pod-sa>@<project>.iam.gserviceaccount.com
does not have storage.objects.create access to the Google Cloud Storage bucket.
Cause: The Kubernetes ServiceAccount used by LangSmith pods is missing the Workload Identity annotation, or the GCP SA is missing the GCS IAM binding.
Diagnosis
kubectl get serviceaccount langsmith-backend -n langsmith \
-o jsonpath='{.metadata.annotations}' | python3 -m json.tool
BUCKET=$(terraform -chdir=infra output -raw storage_bucket_name)
gsutil iam get gs://$BUCKET | grep -A3 "serviceAccount"
GSA=$(terraform -chdir=infra output -raw workload_identity_service_account_email)
gcloud projects get-iam-policy <project-id> \
--flatten="bindings[].members" --filter="bindings.members:$GSA"
Fix
terraform -chdir=infra apply -target=module.iam
make init-values
make deploy
langsmith-ksa missing Workload Identity annotation
Symptom: Operator-spawned agent pods fail to start or get stuck in Pending. Logs show permission errors or the agent bootstrap job hangs.
Cause: langsmith-ksa is created by the LangSmith operator (not Helm) and does not survive namespace teardowns or fresh cluster rebuilds. deploy.sh re-annotates it post-deploy; if a previous deploy was interrupted, the annotation may be missing.
Diagnosis
kubectl get serviceaccount langsmith-ksa -n langsmith \
-o jsonpath='{.metadata.annotations.iam\.gke\.io/gcp-service-account}'
Fix
# Re-run deploy — idempotently creates and annotates langsmith-ksa
make deploy
# Or annotate manually
WI=$(terraform -chdir=infra output -raw workload_identity_annotation)
kubectl create serviceaccount langsmith-ksa -n langsmith --dry-run=client -o yaml \
| kubectl apply -f -
kubectl annotate serviceaccount langsmith-ksa -n langsmith \
iam.gke.io/gcp-service-account="$WI" --overwrite
Helm release stuck in pending-upgrade
Symptom
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Cause: A previous helm upgrade was interrupted (Ctrl+C during --wait). Helm left the release locked.
Fix: deploy.sh detects and auto-recovers this state. If running manually:
helm rollback langsmith -n langsmith --wait --timeout 5m
make deploy
Secret Manager access denied
Symptom
ERROR: PERMISSION_DENIED: Permission 'secretmanager.versions.access'
denied on resource 'projects/.../secrets/...'
Cause: Either secretmanager.googleapis.com is not enabled, or the operator account lacks roles/secretmanager.admin.
Fix
gcloud services enable secretmanager.googleapis.com --project <project-id>
gcloud projects add-iam-policy-binding <project-id> \
--member="user:$(gcloud config get account)" \
--role="roles/secretmanager.admin"
langsmith-postgres or langsmith-redis Secret missing
Symptom: Pods crash with database connection errors immediately after deploy, or kubectl get secrets -n langsmith does not list langsmith-postgres / langsmith-redis.
Cause: The k8s-bootstrap module creates these Secrets. They are absent if terraform apply was not run, failed partway through, or the namespace was deleted out-of-band.
Fix
terraform -chdir=infra apply -target=module.k8s_bootstrap
kubectl get secret langsmith-postgres -n langsmith
kubectl get secret langsmith-redis -n langsmith
Diagnostic commands
Cluster access
gcloud container clusters get-credentials <cluster-name> --region <region> --project <project-id>
kubectl config current-context
kubectl get nodes -o wide
Pods
kubectl get pods -n langsmith
kubectl get pods -n langsmith -w
kubectl describe pod <pod-name> -n langsmith
kubectl logs <pod-name> -n langsmith --tail=50
kubectl logs <pod-name> -n langsmith --previous --tail=50
kubectl logs -n langsmith deploy/langsmith-backend --tail=100 -f
TLS and certificates
kubectl get certificate -n langsmith
kubectl describe certificate <cert-name> -n langsmith
kubectl get challenges -n langsmith
kubectl get clusterissuer
Gateway and load balancer
kubectl get gateway -n langsmith
kubectl get httproute -n langsmith
kubectl get svc -n envoy-gateway-system -o wide
kubectl get pods -n envoy-gateway-system
Helm
helm status langsmith -n langsmith
helm history langsmith -n langsmith
helm get values langsmith -n langsmith
LangSmith Deployment
kubectl get pods -n langsmith | grep -E "host-backend|listener|operator"
kubectl get lgp -n langsmith
kubectl get crd | grep langchain
Workload Identity and IAM
kubectl get serviceaccount langsmith-backend -n langsmith \
-o jsonpath='{.metadata.annotations}' | python3 -m json.tool
kubectl get serviceaccount langsmith-ksa -n langsmith \
-o jsonpath='{.metadata.annotations.iam\.gke\.io/gcp-service-account}'
BUCKET=$(terraform -chdir=infra output -raw storage_bucket_name 2>/dev/null)
gsutil iam get gs://$BUCKET
gcloud iam service-accounts list --project <project-id> --filter="displayName:langsmith"
Secrets and bootstrap
kubectl get secrets -n langsmith
kubectl get secret langsmith-postgres -n langsmith
kubectl get secret langsmith-redis -n langsmith
kubectl get secret langsmith-postgres -n langsmith \
-o jsonpath='{.data.connection_url}' | base64 --decode
gcloud secrets list --project <project-id> --filter="name:langsmith"
gcloud secrets versions access latest \
--secret=langsmith-<prefix>-<env>-postgres-password \
--project <project-id>
make secrets
Quick health check
echo "=== Context ===" && kubectl config current-context
echo "=== Nodes ===" && kubectl get nodes
echo "=== Pods ===" && kubectl get pods -n langsmith
echo "=== Certificate ===" && kubectl get certificate -n langsmith
echo "=== Gateway ===" && kubectl get gateway -n langsmith
echo "=== Secrets ===" && kubectl get secrets -n langsmith | grep -E "langsmith-postgres|langsmith-redis"
echo "=== Helm ===" && helm status langsmith -n langsmith 2>/dev/null | grep -E "STATUS|LAST DEPLOYED"