add further documentation for deployment and rename readme to conform to standards
This commit is contained in:
parent
347831b441
commit
5032cecc04
123
DEPLOY.md
Normal file
123
DEPLOY.md
Normal file
|
|
@ -0,0 +1,123 @@
|
||||||
|
# Pilot Toolkit Web — Deploy Workflow
|
||||||
|
|
||||||
|
This document describes the workflow for deploying changes from the
|
||||||
|
development machine to the SOL cluster. Keep it close.
|
||||||
|
|
||||||
|
## Quick Reference: source change to deployed
|
||||||
|
|
||||||
|
1. **Edit source** on dev machine (`~/pilot-toolkit-web`)
|
||||||
|
2. **Commit and push** to git
|
||||||
|
```bash
|
||||||
|
git add -A
|
||||||
|
git commit -m "Description of change"
|
||||||
|
git push
|
||||||
|
```
|
||||||
|
3. **SSH to sol0** and pull
|
||||||
|
```bash
|
||||||
|
ssh sol0
|
||||||
|
cd ~/pilot-toolkit-web
|
||||||
|
git pull
|
||||||
|
```
|
||||||
|
4. **Rebuild image**
|
||||||
|
```bash
|
||||||
|
docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .
|
||||||
|
```
|
||||||
|
5. **Push to registry**
|
||||||
|
```bash
|
||||||
|
docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
|
||||||
|
```
|
||||||
|
If it says `denied:`, refresh login: `docker login git.bennu.duckdns.org`
|
||||||
|
6. **Trigger rolling update**
|
||||||
|
```bash
|
||||||
|
sudo docker service update --force --with-registry-auth \
|
||||||
|
--image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
|
||||||
|
ptk_ptk
|
||||||
|
```
|
||||||
|
7. **Verify**
|
||||||
|
```bash
|
||||||
|
sudo docker service ps ptk_ptk
|
||||||
|
```
|
||||||
|
Tasks should reach `Running` and stay there past ~2 minutes.
|
||||||
|
|
||||||
|
## Mental model
|
||||||
|
|
||||||
|
Three states must stay in sync:
|
||||||
|
|
||||||
|
1. **Source** — your code on disk and in git
|
||||||
|
2. **Image** — built artifact in local Docker (created by `docker build`)
|
||||||
|
3. **Registry** — uploaded image in Forgejo (uploaded by `docker push`)
|
||||||
|
|
||||||
|
Editing source changes #1 only. You must `build` to update #2 and `push`
|
||||||
|
to update #3. Swarm deploys from #3, not from git.
|
||||||
|
|
||||||
|
A real rebuild produces some layers that say `Pushed`. If `docker push`
|
||||||
|
shows `Layer already exists` for *every* layer, no new build happened.
|
||||||
|
|
||||||
|
## When things go wrong
|
||||||
|
|
||||||
|
| Symptom | Likely cause | Fix |
|
||||||
|
| ------------------------------------------------- | ------------------------------------ | ---------------------------------------------------- |
|
||||||
|
| `docker push` says `denied:` | Stale login session | `docker login git.bennu.duckdns.org` |
|
||||||
|
| `docker push` says HTTP 500 | Forgejo registry hiccup | Retry 2-3 times; works incrementally |
|
||||||
|
| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use `--with-registry-auth` flag |
|
||||||
|
| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container |
|
||||||
|
| Tasks die immediately | Container itself crashes | `docker service logs ptk_ptk` |
|
||||||
|
| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() |
|
||||||
|
| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net |
|
||||||
|
| `docker service ps` shows old timestamps | Looking at task history, not current | Add `--filter desired-state=running` |
|
||||||
|
|
||||||
|
## Verification commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# What image is the service currently configured to run?
|
||||||
|
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image
|
||||||
|
|
||||||
|
# What's the healthcheck inside the locally-built image?
|
||||||
|
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
|
||||||
|
--format '{{json .Config.Healthcheck}}'
|
||||||
|
|
||||||
|
# Show only currently-running tasks (not historical)
|
||||||
|
sudo docker service ps ptk_ptk --filter desired-state=running
|
||||||
|
|
||||||
|
# Tail logs from running tasks
|
||||||
|
sudo docker service logs -f --tail 50 ptk_ptk
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stack management
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Show running services
|
||||||
|
sudo docker service ls
|
||||||
|
|
||||||
|
# Tear down stack entirely
|
||||||
|
sudo docker stack rm ptk
|
||||||
|
|
||||||
|
# Redeploy from scratch
|
||||||
|
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk
|
||||||
|
|
||||||
|
# Force restart without changes
|
||||||
|
sudo docker service update --force ptk_ptk
|
||||||
|
```
|
||||||
|
|
||||||
|
## Healthcheck notes
|
||||||
|
|
||||||
|
The healthcheck must use `127.0.0.1`, not `localhost`. BusyBox `wget`
|
||||||
|
inside Alpine resolves `localhost` to its IPv6 address (`::1`) first,
|
||||||
|
and nginx is only listening on IPv4. Using the explicit IPv4 address
|
||||||
|
sidesteps the resolution issue entirely.
|
||||||
|
|
||||||
|
The healthcheck timing is `interval=30s timeout=3s start-period=5s
|
||||||
|
retries=3`, which means a failing healthcheck takes about 95 seconds
|
||||||
|
to mark the container unhealthy. Tasks dying at the 95-second mark is
|
||||||
|
the signature of a broken healthcheck.
|
||||||
|
|
||||||
|
## Cluster context
|
||||||
|
|
||||||
|
| Component | Value |
|
||||||
|
| --------- | ----- |
|
||||||
|
| Cluster | SOL (4× Raspberry Pi 4, ARM64) |
|
||||||
|
| Manager | sol0 |
|
||||||
|
| Registry | `git.bennu.duckdns.org/jshackney/pilot-toolkit-web` |
|
||||||
|
| Service | `ptk_ptk` (in stack `ptk`) |
|
||||||
|
| URL | `https://ptk.bennu.duckdns.org` |
|
||||||
|
| Stack file location on sol0 | `/root/ptk-stack.yml` |
|
||||||
Loading…
Reference in a new issue