pilot-toolkit-web/DEPLOY.md

124 lines
4.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pilot Toolkit Web — Deploy Workflow
This document describes the workflow for deploying changes from the
development machine to the SOL cluster. Keep it close.
## Quick Reference: source change to deployed
1. **Edit source** on dev machine (`~/pilot-toolkit-web`)
2. **Commit and push** to git
```bash
git add -A
git commit -m "Description of change"
git push
```
3. **SSH to sol0** and pull
```bash
ssh sol0
cd ~/pilot-toolkit-web
git pull
```
4. **Rebuild image**
```bash
docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .
```
5. **Push to registry**
```bash
docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
```
If it says `denied:`, refresh login: `docker login git.bennu.duckdns.org`
6. **Trigger rolling update**
```bash
sudo docker service update --force --with-registry-auth \
--image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
ptk_ptk
```
7. **Verify**
```bash
sudo docker service ps ptk_ptk
```
Tasks should reach `Running` and stay there past ~2 minutes.
## Mental model
Three states must stay in sync:
1. **Source** — your code on disk and in git
2. **Image** — built artifact in local Docker (created by `docker build`)
3. **Registry** — uploaded image in Forgejo (uploaded by `docker push`)
Editing source changes #1 only. You must `build` to update #2 and `push`
to update #3. Swarm deploys from #3, not from git.
A real rebuild produces some layers that say `Pushed`. If `docker push`
shows `Layer already exists` for *every* layer, no new build happened.
## When things go wrong
| Symptom | Likely cause | Fix |
| ------------------------------------------------- | ------------------------------------ | ---------------------------------------------------- |
| `docker push` says `denied:` | Stale login session | `docker login git.bennu.duckdns.org` |
| `docker push` says HTTP 500 | Forgejo registry hiccup | Retry 2-3 times; works incrementally |
| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use `--with-registry-auth` flag |
| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container |
| Tasks die immediately | Container itself crashes | `docker service logs ptk_ptk` |
| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() |
| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net |
| `docker service ps` shows old timestamps | Looking at task history, not current | Add `--filter desired-state=running` |
## Verification commands
```bash
# What image is the service currently configured to run?
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image
# What's the healthcheck inside the locally-built image?
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
--format '{{json .Config.Healthcheck}}'
# Show only currently-running tasks (not historical)
sudo docker service ps ptk_ptk --filter desired-state=running
# Tail logs from running tasks
sudo docker service logs -f --tail 50 ptk_ptk
```
## Stack management
```bash
# Show running services
sudo docker service ls
# Tear down stack entirely
sudo docker stack rm ptk
# Redeploy from scratch
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk
# Force restart without changes
sudo docker service update --force ptk_ptk
```
## Healthcheck notes
The healthcheck must use `127.0.0.1`, not `localhost`. BusyBox `wget`
inside Alpine resolves `localhost` to its IPv6 address (`::1`) first,
and nginx is only listening on IPv4. Using the explicit IPv4 address
sidesteps the resolution issue entirely.
The healthcheck timing is `interval=30s timeout=3s start-period=5s
retries=3`, which means a failing healthcheck takes about 95 seconds
to mark the container unhealthy. Tasks dying at the 95-second mark is
the signature of a broken healthcheck.
## Cluster context
| Component | Value |
| --------- | ----- |
| Cluster | SOL (4× Raspberry Pi 4, ARM64) |
| Manager | sol0 |
| Registry | `git.bennu.duckdns.org/jshackney/pilot-toolkit-web` |
| Service | `ptk_ptk` (in stack `ptk`) |
| URL | `https://ptk.bennu.duckdns.org` |
| Stack file location on sol0 | `/root/ptk-stack.yml` |