diff --git a/DEPLOY.md b/DEPLOY.md new file mode 100644 index 0000000..710af6b --- /dev/null +++ b/DEPLOY.md @@ -0,0 +1,123 @@ +# Pilot Toolkit Web — Deploy Workflow + +This document describes the workflow for deploying changes from the +development machine to the SOL cluster. Keep it close. + +## Quick Reference: source change to deployed + +1. **Edit source** on dev machine (`~/pilot-toolkit-web`) +2. **Commit and push** to git + ```bash + git add -A + git commit -m "Description of change" + git push + ``` +3. **SSH to sol0** and pull + ```bash + ssh sol0 + cd ~/pilot-toolkit-web + git pull + ``` +4. **Rebuild image** + ```bash + docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest . + ``` +5. **Push to registry** + ```bash + docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest + ``` + If it says `denied:`, refresh login: `docker login git.bennu.duckdns.org` +6. **Trigger rolling update** + ```bash + sudo docker service update --force --with-registry-auth \ + --image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \ + ptk_ptk + ``` +7. **Verify** + ```bash + sudo docker service ps ptk_ptk + ``` + Tasks should reach `Running` and stay there past ~2 minutes. + +## Mental model + +Three states must stay in sync: + +1. **Source** — your code on disk and in git +2. **Image** — built artifact in local Docker (created by `docker build`) +3. **Registry** — uploaded image in Forgejo (uploaded by `docker push`) + +Editing source changes #1 only. You must `build` to update #2 and `push` +to update #3. Swarm deploys from #3, not from git. + +A real rebuild produces some layers that say `Pushed`. If `docker push` +shows `Layer already exists` for *every* layer, no new build happened. + +## When things go wrong + +| Symptom | Likely cause | Fix | +| ------------------------------------------------- | ------------------------------------ | ---------------------------------------------------- | +| `docker push` says `denied:` | Stale login session | `docker login git.bennu.duckdns.org` | +| `docker push` says HTTP 500 | Forgejo registry hiccup | Retry 2-3 times; works incrementally | +| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use `--with-registry-auth` flag | +| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container | +| Tasks die immediately | Container itself crashes | `docker service logs ptk_ptk` | +| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() | +| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net | +| `docker service ps` shows old timestamps | Looking at task history, not current | Add `--filter desired-state=running` | + +## Verification commands + +```bash +# What image is the service currently configured to run? +sudo docker service inspect ptk_ptk --pretty | grep -A1 Image + +# What's the healthcheck inside the locally-built image? +docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \ + --format '{{json .Config.Healthcheck}}' + +# Show only currently-running tasks (not historical) +sudo docker service ps ptk_ptk --filter desired-state=running + +# Tail logs from running tasks +sudo docker service logs -f --tail 50 ptk_ptk +``` + +## Stack management + +```bash +# Show running services +sudo docker service ls + +# Tear down stack entirely +sudo docker stack rm ptk + +# Redeploy from scratch +sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk + +# Force restart without changes +sudo docker service update --force ptk_ptk +``` + +## Healthcheck notes + +The healthcheck must use `127.0.0.1`, not `localhost`. BusyBox `wget` +inside Alpine resolves `localhost` to its IPv6 address (`::1`) first, +and nginx is only listening on IPv4. Using the explicit IPv4 address +sidesteps the resolution issue entirely. + +The healthcheck timing is `interval=30s timeout=3s start-period=5s +retries=3`, which means a failing healthcheck takes about 95 seconds +to mark the container unhealthy. Tasks dying at the 95-second mark is +the signature of a broken healthcheck. + +## Cluster context + +| Component | Value | +| --------- | ----- | +| Cluster | SOL (4× Raspberry Pi 4, ARM64) | +| Manager | sol0 | +| Registry | `git.bennu.duckdns.org/jshackney/pilot-toolkit-web` | +| Service | `ptk_ptk` (in stack `ptk`) | +| URL | `https://ptk.bennu.duckdns.org` | +| Stack file location on sol0 | `/root/ptk-stack.yml` | diff --git a/doc/README-web.md b/doc/README.md similarity index 100% rename from doc/README-web.md rename to doc/README.md