add further documentation for deployment and rename readme to conform to standards

This commit is contained in:
handfly 2026-05-03 18:31:07 -04:00
parent 347831b441
commit 5032cecc04
2 changed files with 123 additions and 0 deletions

123
DEPLOY.md Normal file
View file

@ -0,0 +1,123 @@
# Pilot Toolkit Web — Deploy Workflow
This document describes the workflow for deploying changes from the
development machine to the SOL cluster. Keep it close.
## Quick Reference: source change to deployed
1. **Edit source** on dev machine (`~/pilot-toolkit-web`)
2. **Commit and push** to git
```bash
git add -A
git commit -m "Description of change"
git push
```
3. **SSH to sol0** and pull
```bash
ssh sol0
cd ~/pilot-toolkit-web
git pull
```
4. **Rebuild image**
```bash
docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .
```
5. **Push to registry**
```bash
docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
```
If it says `denied:`, refresh login: `docker login git.bennu.duckdns.org`
6. **Trigger rolling update**
```bash
sudo docker service update --force --with-registry-auth \
--image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
ptk_ptk
```
7. **Verify**
```bash
sudo docker service ps ptk_ptk
```
Tasks should reach `Running` and stay there past ~2 minutes.
## Mental model
Three states must stay in sync:
1. **Source** — your code on disk and in git
2. **Image** — built artifact in local Docker (created by `docker build`)
3. **Registry** — uploaded image in Forgejo (uploaded by `docker push`)
Editing source changes #1 only. You must `build` to update #2 and `push`
to update #3. Swarm deploys from #3, not from git.
A real rebuild produces some layers that say `Pushed`. If `docker push`
shows `Layer already exists` for *every* layer, no new build happened.
## When things go wrong
| Symptom | Likely cause | Fix |
| ------------------------------------------------- | ------------------------------------ | ---------------------------------------------------- |
| `docker push` says `denied:` | Stale login session | `docker login git.bennu.duckdns.org` |
| `docker push` says HTTP 500 | Forgejo registry hiccup | Retry 2-3 times; works incrementally |
| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use `--with-registry-auth` flag |
| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container |
| Tasks die immediately | Container itself crashes | `docker service logs ptk_ptk` |
| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() |
| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net |
| `docker service ps` shows old timestamps | Looking at task history, not current | Add `--filter desired-state=running` |
## Verification commands
```bash
# What image is the service currently configured to run?
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image
# What's the healthcheck inside the locally-built image?
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
--format '{{json .Config.Healthcheck}}'
# Show only currently-running tasks (not historical)
sudo docker service ps ptk_ptk --filter desired-state=running
# Tail logs from running tasks
sudo docker service logs -f --tail 50 ptk_ptk
```
## Stack management
```bash
# Show running services
sudo docker service ls
# Tear down stack entirely
sudo docker stack rm ptk
# Redeploy from scratch
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk
# Force restart without changes
sudo docker service update --force ptk_ptk
```
## Healthcheck notes
The healthcheck must use `127.0.0.1`, not `localhost`. BusyBox `wget`
inside Alpine resolves `localhost` to its IPv6 address (`::1`) first,
and nginx is only listening on IPv4. Using the explicit IPv4 address
sidesteps the resolution issue entirely.
The healthcheck timing is `interval=30s timeout=3s start-period=5s
retries=3`, which means a failing healthcheck takes about 95 seconds
to mark the container unhealthy. Tasks dying at the 95-second mark is
the signature of a broken healthcheck.
## Cluster context
| Component | Value |
| --------- | ----- |
| Cluster | SOL (4× Raspberry Pi 4, ARM64) |
| Manager | sol0 |
| Registry | `git.bennu.duckdns.org/jshackney/pilot-toolkit-web` |
| Service | `ptk_ptk` (in stack `ptk`) |
| URL | `https://ptk.bennu.duckdns.org` |
| Stack file location on sol0 | `/root/ptk-stack.yml` |