# How to Troubleshoot

This page is for when something is broken or needs urgent maintenance. It covers how to diagnose problems, how to rotate credentials, how to undo a bad change, and common commands you'll need along the way.

If you're reading this during an incident, start at **Is the Server Down?** and work top to bottom.

***

## Is the Server Down?

Run this first:

```bash
curl -sI https://google.mcp.themailworks.com/health | head -1
```

If you get `HTTP/2 200`, the server is up and the problem is somewhere else (user's browser, Google's API, etc.).

If you get an error, follow the branch below that matches what you see.

***

## Branch A: Can't Connect at All (connection refused, SSL error, NXDOMAIN)

This means the request isn't even reaching the server.

**A1: Is DNS working?**

```bash
dig +short google.mcp.themailworks.com
```

You should see one or two AWS IP addresses. If you see nothing (NXDOMAIN), the DNS record is missing. Check Route 53 and confirm the nameservers match `terraform output route53_nameservers`.

**A2: Is the TLS certificate valid?**

```bash
aws acm list-certificates --query 'CertificateSummaryList[?DomainName==`mcp.themailworks.com`]'
```

Status must be `ISSUED`. If it says `PENDING_VALIDATION`, the DNS validation records are missing from Route 53.

***

## Branch B: Getting 502, 503, or 504 Errors

This means the load balancer is reachable but the container behind it isn't responding.

**B1: Are there any healthy containers?**

```bash
TG_ARN=$(aws elbv2 describe-target-groups \
  --names prod-mcp-google-workspace-tg \
  --query 'TargetGroups[0].TargetGroupArn' --output text)

aws elbv2 describe-target-health --target-group-arn "$TG_ARN"
```

If the state is `unhealthy` or there are zero targets, the container isn't running. Move to B2.

**B2: What's the ECS service doing?**

```bash
aws ecs describe-services \
  --cluster prod-mcp-cluster \
  --services prod-mcp-google-workspace \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,events:events[:5]}'
```

Read the output:

* `running = 0`, `pending > 0`: The container is trying to start but failing. Go to B3.
* `running = 0`, `pending = 0`, `desired > 0`: AWS can't even start the container. Go to B4.
* `running > 0` but health check fails: The container is running but something inside it is broken. Go to B5.

**B3: Why is the container failing to start?**

```bash
TASK=$(aws ecs list-tasks \
  --cluster prod-mcp-cluster \
  --family prod-mcp-google-workspace \
  --desired-status STOPPED \
  --query 'taskArns[0]' --output text)

aws ecs describe-tasks \
  --cluster prod-mcp-cluster \
  --tasks "$TASK" \
  --query 'tasks[0].{stopCode:stopCode,stoppedReason:stoppedReason,containers:containers[*].{name:name,exitCode:lastStatus,reason:reason}}'
```

Common reasons:

| Stop Code                     | What It Means                                             | What to Do                                               |
| ----------------------------- | --------------------------------------------------------- | -------------------------------------------------------- |
| `TaskFailedToStart`           | AWS couldn't pull the container image or read the secrets | Check the ECR image URI and IAM role (B4)                |
| `EssentialContainerExited`    | The container started but crashed                         | Read the logs (B5)                                       |
| `ResourceInitializationError` | Couldn't fetch from Secrets Manager                       | Check the secret path and the execution role in `iam.tf` |

**B4: Can AWS pull the image and read secrets?**

```bash
# Does the container image exist?
aws ecr describe-images \
  --repository-name mcp/workspace \
  --query 'imageDetails[?contains(imageTags, `latest`)]'

# Do the secrets exist?
aws secretsmanager describe-secret --secret-id /prod/mcp/google-oauth
aws secretsmanager describe-secret --secret-id /prod/mcp/splunk
```

If a secret is missing or the execution role doesn't have permission, check the inline policy scope in `iam.tf` (should be `/prod/mcp/*`).

**B5: What do the container logs say?**

```bash
aws logs tail /ecs/prod/mcp/google_workspace --since 10m
```

Look for:

* `ResourceNotFoundException` or `AccessDeniedException`: A secrets or IAM permissions problem.
* `connection refused ... 6379`: Redis is unreachable. Go to Branch C.
* A Python stack trace: The application code crashed. Grab the log snippet and escalate to the team that manages the container code.

***

## Branch C: Redis Problems

**C1: Is Redis running?**

```bash
aws elasticache describe-replication-groups \
  --replication-group-id prod-mcp-redis \
  --query 'ReplicationGroups[0].{status:Status,nodeGroups:NodeGroups[*].{status:Status,primaryEndpoint:PrimaryEndpoint.Address}}'
```

Status must be `available`. If it says `modifying` or `snapshotting`, wait. If `unavailable`, open an AWS Support case. As a temporary workaround, you can set `GOOGLE_MCP_CREDENTIAL_BACKEND=memory` in the container environment to bypass Redis entirely (but users will have to log in again every time the container restarts).

**C2: Can the container actually reach Redis?**

```bash
TASK=$(aws ecs list-tasks --cluster prod-mcp-cluster \
  --service-name prod-mcp-google-workspace \
  --query 'taskArns[0]' --output text)

aws ecs execute-command \
  --cluster prod-mcp-cluster --task "$TASK" \
  --container google_workspace --interactive \
  --command "/bin/sh -c 'nc -zv \$REDIS_HOST 6379; echo exit:\$?'"
```

If the connection is refused, check the `prod-mcp-redis-sg` security group. It must allow TCP 6379 from the `prod-mcp-ecs-tasks-sg` security group.

***

## Branch D: Some Requests Fail, but the Health Check Is Green

**D1: Are there 5xx errors on the load balancer?**

```bash
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_ELB_5XX_Count \
  --dimensions Name=LoadBalancer,Value=$(aws elbv2 describe-load-balancers \
    --names prod-mcp-alb \
    --query 'LoadBalancers[0].LoadBalancerArn' --output text | sed 's|.*loadbalancer/||') \
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 --statistics Sum
```

**D2: Are audit events missing from Splunk?** Run the Splunk verification query from the How It's Built page (Audit Logging section).

**D3: Are users seeing Google login errors?** Check logs for `GOOGLE_OAUTH` or `token` errors. Usually caused by a recent credential rotation that didn't trigger a container restart, or a stale session in Redis. Flush the affected user's session (see Rollback Redis State below) and ask them to log in again.

***

## When Nothing Above Fixes It

1. Pull the last 30 minutes of logs:

   ```bash
   aws logs tail /ecs/prod/mcp/google_workspace --since 30m > /tmp/mcp-logs.txt
   ```
2. Capture the service state:

   ```bash
   aws ecs describe-services --cluster prod-mcp-cluster \
     --services prod-mcp-google-workspace > /tmp/mcp-service.json
   ```
3. Open a Slack thread, attach both files, and tag the platform team.

***

## Rotating Credentials

Two sets of credentials exist. Both are stored in AWS Secrets Manager under `/prod/mcp/`. Terraform created them once and will never overwrite them. All rotation happens through the AWS CLI.

**The most important rule:** Always update the secret in Secrets Manager BEFORE restarting the container. The container reads secrets at startup. If you restart first, the container launches with the old (possibly invalid) credentials.

### Rotate Google OAuth Credentials

**Do this:** after a credential leak, or when GCP project ownership changes.

1. Open [Google Cloud Console](https://console.cloud.google.com), select the `CUSTOM-MCP-SERVER-MAILWORKS` project.
2. Go to APIs and Services > Credentials.
3. Click the OAuth 2.0 client ID, then "Reset Secret." Copy the new Client ID and Client Secret.
4. Update the secret:

   ```bash
   aws secretsmanager put-secret-value \
     --secret-id /prod/mcp/google-oauth \
     --secret-string '{
       "GOOGLE_OAUTH_CLIENT_ID":     "NEW_CLIENT_ID",
       "GOOGLE_OAUTH_CLIENT_SECRET": "NEW_CLIENT_SECRET"
     }'
   ```
5. Verify it saved correctly:

   ```bash
   aws secretsmanager get-secret-value \
     --secret-id /prod/mcp/google-oauth \
     --query 'SecretString' --output text | python3 -m json.tool
   ```
6. Restart the container:

   ```bash
   aws ecs update-service \
     --cluster prod-mcp-cluster \
     --service prod-mcp-google-workspace \
     --force-new-deployment
   ```
7. Watch the rollout:

   ```bash
   watch -n 10 "aws ecs describe-services \
     --cluster prod-mcp-cluster \
     --services prod-mcp-google-workspace \
     --query 'services[0].{running:runningCount,pending:pendingCount,deployments:deployments[*].{status:status,rollout:rolloutState}}'"
   ```

   Wait until you see one deployment with `status: PRIMARY` and `rollout: COMPLETED`.
8. Smoke test:

   ```bash
   curl -I https://google.mcp.themailworks.com/health
   ```
9. Go back to Google Cloud Console and disable or delete the old credentials.

**If something goes wrong:**

* Container won't start: Check `aws logs tail /ecs/prod/mcp/google_workspace --since 5m`. A `ResourceNotFoundException` means the secret path is wrong. An `InvalidRequestException` means the JSON is malformed.
* Container starts but login fails: You probably copied the credentials wrong. Repeat steps 1 through 6. The old containers are still draining during the rolling deploy, so there's no immediate outage.

### Rotate Splunk HEC Credentials

**Do this:** when a Splunk admin issues a new token or after a security event.

**Important:** The Terraform seed used keys `endpoint` and `token`, but the container reads `SPLUNK_HEC_URL` and `SPLUNK_HEC_TOKEN`. Always use the second set of names.

1. In Splunk Cloud: Settings > Data Inputs > HTTP Event Collector. Create a new token scoped to the `mcp_audit` index.
2. Update the secret:

   ```bash
   aws secretsmanager put-secret-value \
     --secret-id /prod/mcp/splunk \
     --secret-string '{
       "SPLUNK_HEC_URL":   "https://http-inputs-<stack>.splunkcloud.com/services/collector/event",
       "SPLUNK_HEC_TOKEN": "NEW_TOKEN"
     }'
   ```
3. Restart the container (same command as above).
4. Verify in Splunk:

   ```spl
   index=mcp_audit source=mcp-google earliest=-5m
   | stats count
   ```
5. Disable the old token in Splunk Cloud.

**If something goes wrong:**

* No events in Splunk after restart: Check container logs for `401 Unauthorized` (wrong token) or `403 Forbidden` (HEC is globally disabled in Splunk).
* Events land in `index=_internal` instead of `mcp_audit`: The token is scoped to the wrong index. Create a new one and repeat.

***

## Rolling Back a Bad Change

### Undo a Terraform Change

**When to use:** You ran `terraform apply` and it broke something, and it's faster to go back than to fix forward.

1. List previous state versions:

   ```bash
   aws s3api list-object-versions \
     --bucket 573946584375-mcp-terraform-state \
     --prefix mcp/terraform.tfstate \
     --query 'Versions[*].{VersionId:VersionId,LastModified:LastModified,IsLatest:IsLatest}' \
     --output table
   ```
2. Download the last known-good version:

   ```bash
   aws s3api get-object \
     --bucket 573946584375-mcp-terraform-state \
     --key mcp/terraform.tfstate \
     --version-id VERSION_ID \
     terraform.tfstate.backup
   ```
3. See what changed:

   ```bash
   diff <(cat terraform.tfstate.backup | python3 -m json.tool) \
        <(terraform show -json | python3 -m json.tool) | less
   ```
4. Restore the old state:

   ```bash
   cp terraform.tfstate.backup terraform.tfstate.rollback-$(date +%Y%m%d%H%M%S)

   aws s3 cp terraform.tfstate.backup \
     s3://573946584375-mcp-terraform-state/mcp/terraform.tfstate
   ```
5. Revert the code to match:

   ```bash
   git log --oneline -10
   git checkout <commit-sha> -- .
   ```
6. Confirm and apply:

   ```bash
   terraform init
   terraform plan -out rollback.out
   terraform apply rollback.out
   ```

**Warning:** Never restore the state file without also reverting the code. If the state says one thing and the code says another, the next `terraform plan` will try to "fix" the mismatch, which could make things worse.

### Undo a Container Deployment

**When to use:** A new container image or config change is causing errors and you need to go back to the previous version right now. This is the fastest rollback because it doesn't touch Terraform.

1. Find the current task definition revision and check what the previous one was running:

   ```bash
   aws ecs describe-task-definition \
     --task-definition prod-mcp-google-workspace \
     --query 'taskDefinition.revision'
   # If current is 42, previous is 41.

   aws ecs describe-task-definition \
     --task-definition prod-mcp-google-workspace:41 \
     --query 'taskDefinition.containerDefinitions[0].image'
   ```
2. Roll back to the previous revision:

   ```bash
   aws ecs update-service \
     --cluster prod-mcp-cluster \
     --service prod-mcp-google-workspace \
     --task-definition prod-mcp-google-workspace:41
   ```
3. Verify:

   ```bash
   curl -sI https://google.mcp.themailworks.com/health
   ```
4. Once stable, open a pull request to revert the bad commit in `themailworks/mcp-google`. Don't leave the pinned revision as a permanent fix; Terraform will redeploy the bad image on the next apply if the code isn't also reverted.

If the previous revision also fails, step back one more. Inspect differences between revisions:

```bash
aws ecs describe-task-definition --task-definition prod-mcp-google-workspace:41 \
  --query 'taskDefinition.containerDefinitions[0].{image:image,environment:environment,secrets:secrets}'
```

### Rollback Redis State

Redis stores login sessions. Sometimes you need to clear specific sessions or the entire cache.

**Clear one user's session (preferred):**

```bash
TASK=$(aws ecs list-tasks --cluster prod-mcp-cluster \
  --service-name prod-mcp-google-workspace \
  --query 'taskArns[0]' --output text)

aws ecs execute-command \
  --cluster prod-mcp-cluster --task "$TASK" \
  --container google_workspace --interactive --command /bin/sh

# Inside the shell:
redis-cli -h "$REDIS_HOST" -p 6379 --tls

# See what keys exist
KEYS oauth:*
KEYS cred:*

# Delete one user's session
DEL oauth:user@example.com
```

**Clear all sessions (last resort, everyone has to log in again):**

```bash
FLUSHDB
```

No container restart is needed after flushing. New sessions are created as users log in again.

**If Redis itself is down:**

1. Confirm it's truly dead, not just in a maintenance state (see Branch C above).
2. As a temporary workaround, set `GOOGLE_MCP_CREDENTIAL_BACKEND=memory` in the container environment. Users have to log in every time the container restarts.
3. To restore from a snapshot:

   ```bash
   aws elasticache describe-snapshots \
     --replication-group-id prod-mcp-redis \
     --query 'Snapshots[*].{name:SnapshotName,status:SnapshotStatus,created:NodeSnapshots[0].SnapshotCreateTime}'
   ```

   Restoring creates a new cluster. Update `REDIS_HOST` and `WORKSPACE_MCP_OAUTH_PROXY_VALKEY_HOST` in Terraform to point to the new endpoint, then apply.

***

## Common Commands

### Restart all services

```bash
for svc in $(aws ecs list-services --cluster prod-mcp-cluster \
  --query 'serviceArns' --output text); do
  aws ecs update-service --cluster prod-mcp-cluster \
    --service "$svc" --force-new-deployment >/dev/null
done
```

### Follow logs in real time

```bash
aws logs tail /ecs/prod/mcp/google_workspace --follow
```

### See what URLs Terraform manages

```bash
terraform output service_urls
terraform output mcp_base_url
```

### Check TLS

```bash
curl -vI https://google.mcp.themailworks.com/health
```

### See the current deployment status

```bash
aws ecs describe-services --cluster prod-mcp-cluster \
  --services prod-mcp-google-workspace \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,deployments:deployments[*].{status:status,taskDef:taskDefinition,rollout:rolloutState}}'
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://knowledge.themailworks.com/mcp-server/how-to-troubleshoot.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.