Pilot Toolkit Web — Deploy Workflow

This document describes the workflow for deploying changes from the development machine to the SOL cluster. Keep it close.

Quick Reference: source change to deployed

Edit source on dev machine (~/pilot-toolkit-web)

Commit and push to git

git add -A
git commit -m "Description of change"
git push

SSH to sol0 and pull

ssh sol0
cd ~/pilot-toolkit-web
git pull

Rebuild image

docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .

Push to registry
```
docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
```
If it says denied:, refresh login: docker login git.bennu.duckdns.org

Trigger rolling update

sudo docker service update --force --with-registry-auth \
  --image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
  ptk_ptk

Verify
```
sudo docker service ps ptk_ptk
```
Tasks should reach Running and stay there past ~2 minutes.

Mental model

Three states must stay in sync:

Source — your code on disk and in git
Image — built artifact in local Docker (created by docker build)
Registry — uploaded image in Forgejo (uploaded by docker push)

Editing source changes #1 only. You must build to update #2 and push to update #3. Swarm deploys from #3, not from git.

A real rebuild produces some layers that say Pushed. If docker push shows Layer already exists for every layer, no new build happened.

When things go wrong

Symptom	Likely cause	Fix
`docker push` says `denied:`	Stale login session	`docker login git.bennu.duckdns.org`
`docker push` says HTTP 500	Forgejo registry hiccup	Retry 2-3 times; works incrementally
Update says "could not be accessed on a registry"	Worker nodes can't auth	Use `--with-registry-auth` flag
Tasks die at exactly 95 seconds	Healthcheck failing	Test healthcheck manually inside running container
Tasks die immediately	Container itself crashes	`docker service logs ptk_ptk`
Service runs but URL gives 404	Traefik routing issue	Check Traefik labels in stack.yml match Host()
Service runs but URL gives 503	Traefik can't reach container	Check container is on cluster-net
`docker service ps` shows old timestamps	Looking at task history, not current	Add `--filter desired-state=running`

Verification commands

# What image is the service currently configured to run?
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image

# What's the healthcheck inside the locally-built image?
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
  --format '{{json .Config.Healthcheck}}'

# Show only currently-running tasks (not historical)
sudo docker service ps ptk_ptk --filter desired-state=running

# Tail logs from running tasks
sudo docker service logs -f --tail 50 ptk_ptk

Stack management

# Show running services
sudo docker service ls

# Tear down stack entirely
sudo docker stack rm ptk

# Redeploy from scratch
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk

# Force restart without changes
sudo docker service update --force ptk_ptk

Healthcheck notes

The healthcheck must use 127.0.0.1, not localhost. BusyBox wget inside Alpine resolves localhost to its IPv6 address (::1) first, and nginx is only listening on IPv4. Using the explicit IPv4 address sidesteps the resolution issue entirely.

The healthcheck timing is interval=30s timeout=3s start-period=5s retries=3, which means a failing healthcheck takes about 95 seconds to mark the container unhealthy. Tasks dying at the 95-second mark is the signature of a broken healthcheck.

Cluster context

Component	Value
Cluster	SOL (4× Raspberry Pi 4, ARM64)
Manager	sol0
Registry	`git.bennu.duckdns.org/jshackney/pilot-toolkit-web`
Service	`ptk_ptk` (in stack `ptk`)
URL	`https://ptk.bennu.duckdns.org`
Stack file location on sol0	`/root/ptk-stack.yml`

4.7 KiB Raw Permalink Blame History Unescape Escape