# How It's Built

This page covers every piece of AWS infrastructure that makes the MCP server work. If you need to understand what a component does, how it's configured, or how to change it, this is where to look.

***

## The Big Picture

The server is made up of six main components, all defined in Terraform and deployed to AWS account `573946584375` in `us-east-1`.

```
Client --> Route 53 (DNS) --> ALB (load balancer, port 443)
                                  |
                                  v
                          ECS Fargate (container)
                            |         |         |
                            v         v         v
                         Redis    Secrets    Splunk
                       (sessions) (passwords) (audit log)
```

Everything below is created automatically by Terraform using a single map called `var.mcp_services`. Each entry in that map produces an ALB rule, target group, ECS service, log group, and DNS record. Want to add a new MCP server for a different service? Add a block to the map, run `terraform apply`, and the entire stack spins up.

***

## Load Balancer (ALB)

**Defined in:** `alb.tf`

The Application Load Balancer is the only thing with a public IP address. Everything behind it is private.

**What it does:**

* Accepts HTTPS traffic at `*.mcp.themailworks.com`.
* Redirects any HTTP request to HTTPS automatically.
* Looks at the hostname in each request (e.g., `google.mcp.themailworks.com`) and sends it to the right container.
* Routes OAuth login paths (`/.well-known/*`, `/oauth2/*`, `/auth/*`) to the Google Workspace service regardless of hostname.
* Returns a 404 for anything that doesn't match a known service.

**How routing works under the hood:**

1. The ALB is created across both public subnets.
2. For each service in `var.mcp_services`, Terraform creates a target group on the service's port with its health check path.
3. A listener on port 443 uses TLS policy `ELBSecurityPolicy-TLS13-1-2-2021-06`.
4. One host-header rule per service routes traffic to the right target group. The OAuth rule sits at priority 10 so it always wins.

**To change the routing:**

* New service: add an entry to `var.mcp_services`. Everything else is automatic.
* New OAuth path: add it to the `path_pattern` values in `aws_lb_listener_rule.oauth_wellknown`.
* TLS policy upgrade: change `ssl_policy` on `aws_lb_listener.https`.

**Alarms (CloudWatch):**

These notify the SNS topic at `arn:aws:sns:us-east-1:573946584375:MCPAlerts`:

* Unhealthy host count > 1 for one minute.
* Healthy host count < 1 for one minute.
* 5xx error count > 1 for one minute.
* Latency alarms on P95, P99, and target response time. **Note:** The `1.0` second thresholds are placeholders. Retune these once you have real traffic data or they'll either fire constantly or never fire at all.

**Check target health:**

```bash
aws elbv2 describe-target-health \
  --target-group-arn $(aws elbv2 describe-target-groups \
    --names prod-mcp-google-workspace-tg \
    --query 'TargetGroups[0].TargetGroupArn' --output text)
```

***

## Containers (ECS Fargate)

**Defined in:** `ecs.tf` **Cluster name:** `prod-mcp-cluster`

The actual MCP server runs as a Python application inside a Docker container on AWS Fargate. Fargate is serverless, meaning there are no servers to manage or patch.

**What Terraform creates for each service:**

1. A CloudWatch log group at `/ecs/prod/mcp/<service>` with 365-day retention.
2. A task definition with CPU, memory, container image, and port settings.
3. An ECS service in the private subnets, connected to the ALB target group.
4. A health check that pings the configured path every 30 seconds.

**Networking:**

* Containers run in private subnets with no public IP.
* Outbound internet traffic goes through NAT gateways (defined in `sg.tf`).
* The only inbound traffic allowed is from the ALB.

**IAM Roles (who can do what):**

* **Execution role** (`ecs_task_execution`): Used by AWS to pull the container image from ECR and read secrets from Secrets Manager. Scoped to `/prod/mcp/*` secrets only.
* **Task role** (`ecs_task`): Used by the running container. Currently only has permissions for shell access (SSM). If a future MCP server needs to talk to S3 or SQS, add permissions here in `iam.tf`.

**How deployments work:**

* `force_new_deployment = true` means every Terraform apply that changes the task definition triggers a new rollout.
* `enable_execute_command = true` lets you open a shell inside a running container for debugging.
* The target group drains connections for 30 seconds before swapping, so users don't get dropped mid-request.

**Deploy a new container image:**

1. Build and push the image to ECR. The default URI is `573946584375.dkr.ecr.us-east-1.amazonaws.com/mcp/workspace:latest`.
2. If the image tag changed, update the `image` field in `var.mcp_services`. If you're using `:latest`, just force a redeploy:

   ```bash
   aws ecs update-service \
     --cluster prod-mcp-cluster \
     --service prod-mcp-google-workspace \
     --force-new-deployment
   ```
3. Watch the deployment:

   ```bash
   aws ecs describe-services \
     --cluster prod-mcp-cluster \
     --services prod-mcp-google-workspace \
     --query 'services[0].deployments'
   ```

**Open a shell inside a running container:**

```bash
TASK=$(aws ecs list-tasks --cluster prod-mcp-cluster \
  --service-name prod-mcp-google-workspace \
  --query 'taskArns[0]' --output text)

aws ecs execute-command \
  --cluster prod-mcp-cluster \
  --task "$TASK" \
  --container google_workspace \
  --interactive --command /bin/sh
```

Requires the SSM Session Manager plugin installed locally.

**View logs:**

```bash
aws logs tail /ecs/prod/mcp/google_workspace --follow
```

***

## Session Storage (Redis)

**Defined in:** `elasticache.tf`

When someone logs in through Google, the server stores their session in Redis so they don't have to log in again every time. Sessions last 30 days.

**What's created:**

* A Redis cluster (ElastiCache) with one `cache.t4g.micro` node running engine 7.1.
* Encryption at rest and in transit.
* A security group that only allows connections from the ECS containers.

**How the container connects:**

These environment variables are injected into the container automatically:

* `WORKSPACE_MCP_OAUTH_PROXY_STORAGE_BACKEND=valkey`
* `WORKSPACE_MCP_OAUTH_PROXY_VALKEY_HOST=<redis endpoint>`
* `WORKSPACE_MCP_OAUTH_PROXY_VALKEY_PORT=6379`
* `WORKSPACE_MCP_OAUTH_PROXY_VALKEY_USE_TLS=true`
* `GOOGLE_MCP_CREDENTIAL_BACKEND=redis`
* `REDIS_HOST=<redis endpoint>`, `REDIS_PORT=6379`, `REDIS_USE_TLS=true`

**To scale Redis for high availability:**

Set `num_cache_clusters = 2` in `elasticache.tf` and apply. This adds a read replica and enables automatic failover. With only one node (current state), if Redis goes down, all users have to log in again when it comes back.

**To upgrade capacity:**

Change `node_type` (e.g., `cache.t4g.small`) and apply. This causes a brief failover.

**To connect for debugging:**

From inside a container shell:

```bash
redis-cli -h "$REDIS_HOST" -p 6379 --tls
```

**Backups:**

`final_snapshot_identifier = "prod-mcp-redis-final"` means a snapshot is taken if the cluster is ever destroyed. There are no scheduled backups. Add `snapshot_retention_limit` if you want regular automatic snapshots.

***

## Secrets (AWS Secrets Manager)

**Defined in:** `secrets.tf`

Passwords and API keys are stored in AWS Secrets Manager and injected into the container at startup. They are never stored in code or configuration files.

**Current secrets:**

| Name         | Path                     | Keys                                                   | Used By                   |
| ------------ | ------------------------ | ------------------------------------------------------ | ------------------------- |
| Google OAuth | `/prod/mcp/google-oauth` | `GOOGLE_OAUTH_CLIENT_ID`, `GOOGLE_OAUTH_CLIENT_SECRET` | Container (Google login)  |
| Splunk HEC   | `/prod/mcp/splunk`       | `SPLUNK_HEC_URL`, `SPLUNK_HEC_TOKEN`                   | Container (audit logging) |

Both secrets have a 7-day recovery window, so accidental deletion is reversible.

**Important:** Terraform only sets these once (at first apply). After that, the values are managed through the AWS CLI or Console, not Terraform. This is by design:

```hcl
lifecycle {
  ignore_changes = [secret_string]
}
```

For step-by-step rotation instructions, see the How to Troubleshoot page.

**To add a secret for a new MCP service:**

1. Add a new `aws_secretsmanager_secret` resource in `secrets.tf` under the `/prod/mcp/` prefix.
2. Add a matching `aws_secretsmanager_secret_version` with `ignore_changes = [secret_string]`.
3. In `ecs.tf`, add a conditional branch in the `secrets` block to map the JSON keys to environment variables.
4. Apply, then use `aws secretsmanager put-secret-value` to store the real credentials.

No IAM changes are needed as long as new secrets use the `/prod/mcp/` prefix.

***

## Audit Logging (Splunk)

Every time someone uses a tool through Claude or logs in, the server sends an event to Splunk Cloud. These logs record who did what and when, but never the actual content of documents or emails.

**How it's wired (in the container's environment):**

* `SPLUNK_INDEX = mcp_audit`
* `SPLUNK_SOURCE = "mcp-google"`
* `AUDIT_LOG_ENABLED = "true"`
* `SPLUNK_HEC_URL` and `SPLUNK_HEC_TOKEN` come from Secrets Manager.

**To turn off audit logging temporarily:**

Set `AUDIT_LOG_ENABLED = "false"` in the environment block in `ecs.tf` and apply. CloudWatch logging is unaffected.

**To verify events are arriving:**

In Splunk:

```spl
index=mcp_audit source=mcp-google earliest=-15m
| stats count by sourcetype
```

If the count is zero, check container logs for HEC errors:

```bash
aws logs tail /ecs/prod/mcp/google_workspace --since 15m --filter-pattern HEC
```

A 401 means the token is wrong. A 403 means HEC is disabled on the Splunk side.

***

## DNS and TLS Certificates

**Defined in:** `dns_acm.tf` and `main.tf`

* Route 53 hosts the `themailworks.com` zone. After first apply, verify the registrar nameservers match with `terraform output route53_nameservers`.
* A single wildcard ACM certificate covers `mcp.themailworks.com` and `*.mcp.themailworks.com`. Adding new MCP servers does not require a new certificate.
* Each service gets an alias A record in Route 53 pointing to the ALB.

***

## Adding a New MCP Server (e.g., Slack)

The `var.mcp_services` map is the single place to add a new server. One new entry produces the full stack: ALB rule, target group, ECS service, log group, and DNS record.

1. Add a block in `variables.tf`:

   ```hcl
   slack = {
     path_prefix       = "slack"
     image             = "573946584375.dkr.ecr.us-east-1.amazonaws.com/mcp/slack:latest"
     container_port    = 8001
     cpu               = 256
     memory            = 512
     desired_count     = 1
     health_check_path = "/health"
     environment_vars  = {}
   }
   ```
2. If it needs secrets, follow the Secrets Manager steps above.
3. If it needs Redis, extend the conditional environment block in `ecs.tf` the same way the `google_workspace` branch does.
4. `terraform apply`. The new server will be live at `https://slack.mcp.themailworks.com`.

**To take a service offline without deleting it:**

Set `desired_count = 0` and apply. The ALB will return 503. Remove the entry entirely to tear down all resources.

***

## Observability Summary

| What                        | Where                                                                           |
| --------------------------- | ------------------------------------------------------------------------------- |
| Application logs            | AWS CloudWatch (`/ecs/prod/mcp/google_workspace`, 365-day retention)            |
| Audit events (who did what) | Splunk Cloud (`mcp_audit` index)                                                |
| Deployment history          | GitHub Actions                                                                  |
| Alarms                      | CloudWatch, delivered via SNS to `arn:aws:sns:us-east-1:573946584375:MCPAlerts` |

***

## Known Issues

* **Latency alarm thresholds are placeholders.** The `1.0` second values in `alb.tf` need to be tuned to real baselines or they'll generate noise.
* **ALB deletion protection is on.** You must disable `enable_deletion_protection` explicitly before Terraform can destroy the ALB.
* **Splunk secret key mismatch.** The Terraform seed JSON uses `endpoint`/`token`, but the container reads `SPLUNK_HEC_URL`/`SPLUNK_HEC_TOKEN`. This works today because the values were manually corrected in Secrets Manager, but plan to align the seed on next rotation.
* **Single Redis node.** No automatic failover until you set `num_cache_clusters = 2`. If Redis goes down, everyone has to log in again.
* **ECS task role is minimal.** If a new MCP server needs AWS APIs beyond SSM, you'll need to add policies in `iam.tf`.
* **Only 3 of 12 Google APIs are enabled in the GCP project.** Enabling new tool families requires enabling the corresponding API first.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://knowledge.themailworks.com/mcp-server/how-its-built.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
