Skip to main content

Documentation Index

Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-featse-1779998369-ad736a3.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This page documents common issues, fixes, and diagnostic commands for LangSmith deployments provisioned with the GCP Terraform modules.
Before upgrading, review the LangSmith self-hosted changelog for breaking changes and required variable updates. Run gcloud container clusters get-credentials <cluster-name> --region <region> --project <project-id> before running any kubectl commands.

Automated diagnostics

Before running individual commands, try the bundled scripts:
# Full deployment health check + next-step guidance
make status

# Secret Manager validation
make secrets    # → manage-secrets.sh validate

Known issues

terraform apply fails: GCP APIs not enabled

Symptom
Error 403: ... has not been used in project <project-id> before or it is disabled.
Cause: Required GCP APIs are not enabled. Terraform enables them via google_project_service, but cloudresourcemanager.googleapis.com must already be enabled for Terraform to enable the others. Fix
gcloud services enable cloudresourcemanager.googleapis.com --project <project-id>
cd modules/gcp/infra
terraform apply -var-file=terraform.tfvars

GKE cluster API server not accessible after apply

Symptom
Error: Get "https://<cluster-endpoint>/api/v1/namespaces": dial tcp: connection refused
Cause: The GKE control plane takes 10 to 15 minutes to become fully operational. Terraform waits for RUNNING then adds a 90-second buffer. Cold-start API activation on slow projects can exceed the window. Fix: Wait for RUNNING, then re-run:
gcloud container clusters describe <cluster-name> \
  --region <region> --project <project-id> --format="value(status)"

terraform apply -var-file=terraform.tfvars

GKE nodes not joining (NotReady)

Symptom: kubectl get nodes shows no nodes or nodes stuck in NotReady. Cause: Node pool service account lacks roles/container.nodeServiceAccount, or VPC firewall rules block node-to-control-plane communication. Fix
gcloud container node-pools describe <pool-name> \
  --cluster <cluster-name> --region <region> \
  --format="value(config.serviceAccount)"

gcloud projects add-iam-policy-binding <project-id> \
  --member="serviceAccount:<node-sa-email>" \
  --role="roles/container.nodeServiceAccount"

gcloud compute firewall-rules list --filter="network:<vpc-name>"

Cloud SQL connection refused from GKE pods

Symptom: Backend logs show connection refused or no route to host for the Cloud SQL private IP. Cause: The private service connection (VPC peering) is not established, or the allocated IP range is too small. Often happens when servicenetworking.googleapis.com was not enabled before the networking module ran. Fix
gcloud services vpc-peerings list --network <vpc-name> --project <project-id>
gcloud sql instances describe <instance-name> --format="value(ipAddresses)"
gcloud compute networks peerings list --network <vpc-name>
If peering is missing, ensure enable_private_service_connection = true and re-apply:
terraform apply -var-file=terraform.tfvars -target=module.networking
terraform apply -var-file=terraform.tfvars

Memorystore Redis connection timeout

Symptom: Pods cannot connect to Redis. Logs show dial tcp: connection timed out or redis: connection refused. Cause: The Memorystore authorized_network does not match the GKE VPC, or the Redis private IP is on a range not routable from the GKE subnet. Fix
gcloud redis instances describe <instance-name> --region <region> \
  --format="value(host,authorizedNetwork)"

kubectl run redis-test --rm -it --image=redis:7 -n langsmith -- \
  redis-cli -h <redis-private-ip> ping
# Expected: PONG

cert-manager fails to issue Let’s Encrypt certificate

Symptom: kubectl get certificate -n langsmith shows READY=False. HTTP01 challenge failing. Cause: The DNS A record does not point to the Envoy Gateway IP, or port 80 is blocked on the load balancer. Fix
kubectl get svc -n envoy-gateway-system \
  -l gateway.envoyproxy.io/owning-gateway-name=langsmith-gateway \
  -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}'

kubectl describe certificate <cert-name> -n langsmith
kubectl get challenges -n langsmith
kubectl describe challenge -n langsmith

dig +short <your-langsmith-domain>
The DNS A record must resolve to the Gateway IP before the certificate can be issued. cert-manager’s HTTP01 solver needs port 80 to be reachable from the internet.

GCS bucket access denied from LangSmith pods

Symptom: Backend logs show AccessDeniedException: 403 Insufficient Permission or 403 Forbidden when writing to GCS. Cause: HMAC credentials passed to Helm are incorrect, or the service account that owns the HMAC key lacks roles/storage.objectAdmin on the bucket. Fix
helm get values langsmith -n langsmith | grep bucketName

gsutil config -a   # configure with your HMAC key
gsutil ls gs://<bucket-name>

gcloud storage buckets get-iam-policy gs://<bucket-name>
Create a new HMAC key in GCP Console under Storage → Settings → Interoperability. The key’s service account must have roles/storage.objectAdmin on the bucket.

Envoy Gateway webhook blocking GKE operations

Symptom
Error from server (InternalError): failed calling webhook "validate.gateway.envoyproxy.io"
Cause: The Envoy Gateway admission webhook is not ready or its caBundle is stale. Fix
kubectl get pods -n envoy-gateway-system

kubectl rollout restart deployment/envoy-gateway -n envoy-gateway-system
kubectl rollout status deployment/envoy-gateway -n envoy-gateway-system

Envoy Gateway external IP changed after re-apply

Symptom: DNS no longer resolves to the correct IP after Terraform re-apply, or existing firewall allowlists stop working. Cause: The Envoy Gateway external IP is tied to the Gateway Kubernetes resource managed by Terraform. If the resource is deleted and recreated (terraform taint, a module change that forces replacement, or terraform destroy + re-apply), GCP issues a new IP. There is no way to reserve the original IP without pre-allocating a static regional address. Prevention
  • Do not terraform taint or manually delete the Gateway resource.
  • Use make destroy + make apply only for full teardown and rebuild.
  • Before any operation that might recreate the Gateway, note the current IP.
Recovery: Update your DNS A record to the new IP:
kubectl get gateway -n langsmith -o jsonpath='{.items[0].status.addresses[0].value}'

gcloud dns record-sets update <your-domain>. \
  --type=A --ttl=300 \
  --rrdatas=<new-ip> \
  --zone=<zone-name> \
  --project=<project-id>

terraform destroy fails: deletion protection enabled

Symptom
Error: googleapi: Error 409: The instance is protected from deletion.
Cause: gke_deletion_protection = true (default) or postgres_deletion_protection = true prevents Terraform from destroying the resources. Fix
# terraform.tfvars
gke_deletion_protection      = false
postgres_deletion_protection = false
terraform apply -var-file=terraform.tfvars
terraform destroy

Workload Identity not working (GCS permission denied)

Symptom
AccessDeniedException: 403 <pod-sa>@<project>.iam.gserviceaccount.com
  does not have storage.objects.create access to the Google Cloud Storage bucket.
Cause: The Kubernetes ServiceAccount used by LangSmith pods is missing the Workload Identity annotation, or the GCP SA is missing the GCS IAM binding. Diagnosis
kubectl get serviceaccount langsmith-backend -n langsmith \
  -o jsonpath='{.metadata.annotations}' | python3 -m json.tool

BUCKET=$(terraform -chdir=infra output -raw storage_bucket_name)
gsutil iam get gs://$BUCKET | grep -A3 "serviceAccount"

GSA=$(terraform -chdir=infra output -raw workload_identity_service_account_email)
gcloud projects get-iam-policy <project-id> \
  --flatten="bindings[].members" --filter="bindings.members:$GSA"
Fix
terraform -chdir=infra apply -target=module.iam
make init-values
make deploy

langsmith-ksa missing Workload Identity annotation

Symptom: Operator-spawned agent pods fail to start or get stuck in Pending. Logs show permission errors or the agent bootstrap job hangs. Cause: langsmith-ksa is created by the LangSmith operator (not Helm) and does not survive namespace teardowns or fresh cluster rebuilds. deploy.sh re-annotates it post-deploy; if a previous deploy was interrupted, the annotation may be missing. Diagnosis
kubectl get serviceaccount langsmith-ksa -n langsmith \
  -o jsonpath='{.metadata.annotations.iam\.gke\.io/gcp-service-account}'
Fix
# Re-run deploy — idempotently creates and annotates langsmith-ksa
make deploy

# Or annotate manually
WI=$(terraform -chdir=infra output -raw workload_identity_annotation)
kubectl create serviceaccount langsmith-ksa -n langsmith --dry-run=client -o yaml \
  | kubectl apply -f -
kubectl annotate serviceaccount langsmith-ksa -n langsmith \
  iam.gke.io/gcp-service-account="$WI" --overwrite

Helm release stuck in pending-upgrade

Symptom
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Cause: A previous helm upgrade was interrupted (Ctrl+C during --wait). Helm left the release locked. Fix: deploy.sh detects and auto-recovers this state. If running manually:
helm rollback langsmith -n langsmith --wait --timeout 5m
make deploy

Secret Manager access denied

Symptom
ERROR: PERMISSION_DENIED: Permission 'secretmanager.versions.access'
  denied on resource 'projects/.../secrets/...'
Cause: Either secretmanager.googleapis.com is not enabled, or the operator account lacks roles/secretmanager.admin. Fix
gcloud services enable secretmanager.googleapis.com --project <project-id>

gcloud projects add-iam-policy-binding <project-id> \
  --member="user:$(gcloud config get account)" \
  --role="roles/secretmanager.admin"

langsmith-postgres or langsmith-redis Secret missing

Symptom: Pods crash with database connection errors immediately after deploy, or kubectl get secrets -n langsmith does not list langsmith-postgres / langsmith-redis. Cause: The k8s-bootstrap module creates these Secrets. They are absent if terraform apply was not run, failed partway through, or the namespace was deleted out-of-band. Fix
terraform -chdir=infra apply -target=module.k8s_bootstrap

kubectl get secret langsmith-postgres -n langsmith
kubectl get secret langsmith-redis -n langsmith

Diagnostic commands

Cluster access

gcloud container clusters get-credentials <cluster-name> --region <region> --project <project-id>
kubectl config current-context
kubectl get nodes -o wide

Pods

kubectl get pods -n langsmith
kubectl get pods -n langsmith -w
kubectl describe pod <pod-name> -n langsmith
kubectl logs <pod-name> -n langsmith --tail=50
kubectl logs <pod-name> -n langsmith --previous --tail=50
kubectl logs -n langsmith deploy/langsmith-backend --tail=100 -f

TLS and certificates

kubectl get certificate -n langsmith
kubectl describe certificate <cert-name> -n langsmith
kubectl get challenges -n langsmith
kubectl get clusterissuer

Gateway and load balancer

kubectl get gateway -n langsmith
kubectl get httproute -n langsmith
kubectl get svc -n envoy-gateway-system -o wide
kubectl get pods -n envoy-gateway-system

Helm

helm status langsmith -n langsmith
helm history langsmith -n langsmith
helm get values langsmith -n langsmith

LangSmith Deployment

kubectl get pods -n langsmith | grep -E "host-backend|listener|operator"
kubectl get lgp -n langsmith
kubectl get crd | grep langchain

Workload Identity and IAM

kubectl get serviceaccount langsmith-backend -n langsmith \
  -o jsonpath='{.metadata.annotations}' | python3 -m json.tool

kubectl get serviceaccount langsmith-ksa -n langsmith \
  -o jsonpath='{.metadata.annotations.iam\.gke\.io/gcp-service-account}'

BUCKET=$(terraform -chdir=infra output -raw storage_bucket_name 2>/dev/null)
gsutil iam get gs://$BUCKET

gcloud iam service-accounts list --project <project-id> --filter="displayName:langsmith"

Secrets and bootstrap

kubectl get secrets -n langsmith
kubectl get secret langsmith-postgres -n langsmith
kubectl get secret langsmith-redis -n langsmith

kubectl get secret langsmith-postgres -n langsmith \
  -o jsonpath='{.data.connection_url}' | base64 --decode

gcloud secrets list --project <project-id> --filter="name:langsmith"

gcloud secrets versions access latest \
  --secret=langsmith-<prefix>-<env>-postgres-password \
  --project <project-id>

make secrets

Quick health check

echo "=== Context ===" && kubectl config current-context
echo "=== Nodes ===" && kubectl get nodes
echo "=== Pods ===" && kubectl get pods -n langsmith
echo "=== Certificate ===" && kubectl get certificate -n langsmith
echo "=== Gateway ===" && kubectl get gateway -n langsmith
echo "=== Secrets ===" && kubectl get secrets -n langsmith | grep -E "langsmith-postgres|langsmith-redis"
echo "=== Helm ===" && helm status langsmith -n langsmith 2>/dev/null | grep -E "STATUS|LAST DEPLOYED"