124 lines
4.7 KiB
Markdown
124 lines
4.7 KiB
Markdown
# Pilot Toolkit Web — Deploy Workflow
|
||
|
||
This document describes the workflow for deploying changes from the
|
||
development machine to the SOL cluster. Keep it close.
|
||
|
||
## Quick Reference: source change to deployed
|
||
|
||
1. **Edit source** on dev machine (`~/pilot-toolkit-web`)
|
||
2. **Commit and push** to git
|
||
```bash
|
||
git add -A
|
||
git commit -m "Description of change"
|
||
git push
|
||
```
|
||
3. **SSH to sol0** and pull
|
||
```bash
|
||
ssh sol0
|
||
cd ~/pilot-toolkit-web
|
||
git pull
|
||
```
|
||
4. **Rebuild image**
|
||
```bash
|
||
docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .
|
||
```
|
||
5. **Push to registry**
|
||
```bash
|
||
docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
|
||
```
|
||
If it says `denied:`, refresh login: `docker login git.bennu.duckdns.org`
|
||
6. **Trigger rolling update**
|
||
```bash
|
||
sudo docker service update --force --with-registry-auth \
|
||
--image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
|
||
ptk_ptk
|
||
```
|
||
7. **Verify**
|
||
```bash
|
||
sudo docker service ps ptk_ptk
|
||
```
|
||
Tasks should reach `Running` and stay there past ~2 minutes.
|
||
|
||
## Mental model
|
||
|
||
Three states must stay in sync:
|
||
|
||
1. **Source** — your code on disk and in git
|
||
2. **Image** — built artifact in local Docker (created by `docker build`)
|
||
3. **Registry** — uploaded image in Forgejo (uploaded by `docker push`)
|
||
|
||
Editing source changes #1 only. You must `build` to update #2 and `push`
|
||
to update #3. Swarm deploys from #3, not from git.
|
||
|
||
A real rebuild produces some layers that say `Pushed`. If `docker push`
|
||
shows `Layer already exists` for *every* layer, no new build happened.
|
||
|
||
## When things go wrong
|
||
|
||
| Symptom | Likely cause | Fix |
|
||
| ------------------------------------------------- | ------------------------------------ | ---------------------------------------------------- |
|
||
| `docker push` says `denied:` | Stale login session | `docker login git.bennu.duckdns.org` |
|
||
| `docker push` says HTTP 500 | Forgejo registry hiccup | Retry 2-3 times; works incrementally |
|
||
| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use `--with-registry-auth` flag |
|
||
| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container |
|
||
| Tasks die immediately | Container itself crashes | `docker service logs ptk_ptk` |
|
||
| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() |
|
||
| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net |
|
||
| `docker service ps` shows old timestamps | Looking at task history, not current | Add `--filter desired-state=running` |
|
||
|
||
## Verification commands
|
||
|
||
```bash
|
||
# What image is the service currently configured to run?
|
||
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image
|
||
|
||
# What's the healthcheck inside the locally-built image?
|
||
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
|
||
--format '{{json .Config.Healthcheck}}'
|
||
|
||
# Show only currently-running tasks (not historical)
|
||
sudo docker service ps ptk_ptk --filter desired-state=running
|
||
|
||
# Tail logs from running tasks
|
||
sudo docker service logs -f --tail 50 ptk_ptk
|
||
```
|
||
|
||
## Stack management
|
||
|
||
```bash
|
||
# Show running services
|
||
sudo docker service ls
|
||
|
||
# Tear down stack entirely
|
||
sudo docker stack rm ptk
|
||
|
||
# Redeploy from scratch
|
||
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk
|
||
|
||
# Force restart without changes
|
||
sudo docker service update --force ptk_ptk
|
||
```
|
||
|
||
## Healthcheck notes
|
||
|
||
The healthcheck must use `127.0.0.1`, not `localhost`. BusyBox `wget`
|
||
inside Alpine resolves `localhost` to its IPv6 address (`::1`) first,
|
||
and nginx is only listening on IPv4. Using the explicit IPv4 address
|
||
sidesteps the resolution issue entirely.
|
||
|
||
The healthcheck timing is `interval=30s timeout=3s start-period=5s
|
||
retries=3`, which means a failing healthcheck takes about 95 seconds
|
||
to mark the container unhealthy. Tasks dying at the 95-second mark is
|
||
the signature of a broken healthcheck.
|
||
|
||
## Cluster context
|
||
|
||
| Component | Value |
|
||
| --------- | ----- |
|
||
| Cluster | SOL (4× Raspberry Pi 4, ARM64) |
|
||
| Manager | sol0 |
|
||
| Registry | `git.bennu.duckdns.org/jshackney/pilot-toolkit-web` |
|
||
| Service | `ptk_ptk` (in stack `ptk`) |
|
||
| URL | `https://ptk.bennu.duckdns.org` |
|
||
| Stack file location on sol0 | `/root/ptk-stack.yml` |
|