4.7 KiB
Pilot Toolkit Web — Deploy Workflow
This document describes the workflow for deploying changes from the development machine to the SOL cluster. Keep it close.
Quick Reference: source change to deployed
- Edit source on dev machine (
~/pilot-toolkit-web) - Commit and push to git
git add -A git commit -m "Description of change" git push - SSH to sol0 and pull
ssh sol0 cd ~/pilot-toolkit-web git pull - Rebuild image
docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest . - Push to registry
If it saysdocker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latestdenied:, refresh login:docker login git.bennu.duckdns.org - Trigger rolling update
sudo docker service update --force --with-registry-auth \ --image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \ ptk_ptk - Verify
Tasks should reachsudo docker service ps ptk_ptkRunningand stay there past ~2 minutes.
Mental model
Three states must stay in sync:
- Source — your code on disk and in git
- Image — built artifact in local Docker (created by
docker build) - Registry — uploaded image in Forgejo (uploaded by
docker push)
Editing source changes #1 only. You must build to update #2 and push
to update #3. Swarm deploys from #3, not from git.
A real rebuild produces some layers that say Pushed. If docker push
shows Layer already exists for every layer, no new build happened.
When things go wrong
| Symptom | Likely cause | Fix |
|---|---|---|
docker push says denied: |
Stale login session | docker login git.bennu.duckdns.org |
docker push says HTTP 500 |
Forgejo registry hiccup | Retry 2-3 times; works incrementally |
| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use --with-registry-auth flag |
| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container |
| Tasks die immediately | Container itself crashes | docker service logs ptk_ptk |
| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() |
| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net |
docker service ps shows old timestamps |
Looking at task history, not current | Add --filter desired-state=running |
Verification commands
# What image is the service currently configured to run?
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image
# What's the healthcheck inside the locally-built image?
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
--format '{{json .Config.Healthcheck}}'
# Show only currently-running tasks (not historical)
sudo docker service ps ptk_ptk --filter desired-state=running
# Tail logs from running tasks
sudo docker service logs -f --tail 50 ptk_ptk
Stack management
# Show running services
sudo docker service ls
# Tear down stack entirely
sudo docker stack rm ptk
# Redeploy from scratch
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk
# Force restart without changes
sudo docker service update --force ptk_ptk
Healthcheck notes
The healthcheck must use 127.0.0.1, not localhost. BusyBox wget
inside Alpine resolves localhost to its IPv6 address (::1) first,
and nginx is only listening on IPv4. Using the explicit IPv4 address
sidesteps the resolution issue entirely.
The healthcheck timing is interval=30s timeout=3s start-period=5s retries=3, which means a failing healthcheck takes about 95 seconds
to mark the container unhealthy. Tasks dying at the 95-second mark is
the signature of a broken healthcheck.
Cluster context
| Component | Value |
|---|---|
| Cluster | SOL (4× Raspberry Pi 4, ARM64) |
| Manager | sol0 |
| Registry | git.bennu.duckdns.org/jshackney/pilot-toolkit-web |
| Service | ptk_ptk (in stack ptk) |
| URL | https://ptk.bennu.duckdns.org |
| Stack file location on sol0 | /root/ptk-stack.yml |