add further documentation for deployment and rename readme to conform to standards
This commit is contained in:
parent
347831b441
commit
5032cecc04
123
DEPLOY.md
Normal file
123
DEPLOY.md
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
# Pilot Toolkit Web — Deploy Workflow
|
||||
|
||||
This document describes the workflow for deploying changes from the
|
||||
development machine to the SOL cluster. Keep it close.
|
||||
|
||||
## Quick Reference: source change to deployed
|
||||
|
||||
1. **Edit source** on dev machine (`~/pilot-toolkit-web`)
|
||||
2. **Commit and push** to git
|
||||
```bash
|
||||
git add -A
|
||||
git commit -m "Description of change"
|
||||
git push
|
||||
```
|
||||
3. **SSH to sol0** and pull
|
||||
```bash
|
||||
ssh sol0
|
||||
cd ~/pilot-toolkit-web
|
||||
git pull
|
||||
```
|
||||
4. **Rebuild image**
|
||||
```bash
|
||||
docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .
|
||||
```
|
||||
5. **Push to registry**
|
||||
```bash
|
||||
docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
|
||||
```
|
||||
If it says `denied:`, refresh login: `docker login git.bennu.duckdns.org`
|
||||
6. **Trigger rolling update**
|
||||
```bash
|
||||
sudo docker service update --force --with-registry-auth \
|
||||
--image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
|
||||
ptk_ptk
|
||||
```
|
||||
7. **Verify**
|
||||
```bash
|
||||
sudo docker service ps ptk_ptk
|
||||
```
|
||||
Tasks should reach `Running` and stay there past ~2 minutes.
|
||||
|
||||
## Mental model
|
||||
|
||||
Three states must stay in sync:
|
||||
|
||||
1. **Source** — your code on disk and in git
|
||||
2. **Image** — built artifact in local Docker (created by `docker build`)
|
||||
3. **Registry** — uploaded image in Forgejo (uploaded by `docker push`)
|
||||
|
||||
Editing source changes #1 only. You must `build` to update #2 and `push`
|
||||
to update #3. Swarm deploys from #3, not from git.
|
||||
|
||||
A real rebuild produces some layers that say `Pushed`. If `docker push`
|
||||
shows `Layer already exists` for *every* layer, no new build happened.
|
||||
|
||||
## When things go wrong
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
| ------------------------------------------------- | ------------------------------------ | ---------------------------------------------------- |
|
||||
| `docker push` says `denied:` | Stale login session | `docker login git.bennu.duckdns.org` |
|
||||
| `docker push` says HTTP 500 | Forgejo registry hiccup | Retry 2-3 times; works incrementally |
|
||||
| Update says "could not be accessed on a registry" | Worker nodes can't auth | Use `--with-registry-auth` flag |
|
||||
| Tasks die at exactly 95 seconds | Healthcheck failing | Test healthcheck manually inside running container |
|
||||
| Tasks die immediately | Container itself crashes | `docker service logs ptk_ptk` |
|
||||
| Service runs but URL gives 404 | Traefik routing issue | Check Traefik labels in stack.yml match Host() |
|
||||
| Service runs but URL gives 503 | Traefik can't reach container | Check container is on cluster-net |
|
||||
| `docker service ps` shows old timestamps | Looking at task history, not current | Add `--filter desired-state=running` |
|
||||
|
||||
## Verification commands
|
||||
|
||||
```bash
|
||||
# What image is the service currently configured to run?
|
||||
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image
|
||||
|
||||
# What's the healthcheck inside the locally-built image?
|
||||
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
|
||||
--format '{{json .Config.Healthcheck}}'
|
||||
|
||||
# Show only currently-running tasks (not historical)
|
||||
sudo docker service ps ptk_ptk --filter desired-state=running
|
||||
|
||||
# Tail logs from running tasks
|
||||
sudo docker service logs -f --tail 50 ptk_ptk
|
||||
```
|
||||
|
||||
## Stack management
|
||||
|
||||
```bash
|
||||
# Show running services
|
||||
sudo docker service ls
|
||||
|
||||
# Tear down stack entirely
|
||||
sudo docker stack rm ptk
|
||||
|
||||
# Redeploy from scratch
|
||||
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk
|
||||
|
||||
# Force restart without changes
|
||||
sudo docker service update --force ptk_ptk
|
||||
```
|
||||
|
||||
## Healthcheck notes
|
||||
|
||||
The healthcheck must use `127.0.0.1`, not `localhost`. BusyBox `wget`
|
||||
inside Alpine resolves `localhost` to its IPv6 address (`::1`) first,
|
||||
and nginx is only listening on IPv4. Using the explicit IPv4 address
|
||||
sidesteps the resolution issue entirely.
|
||||
|
||||
The healthcheck timing is `interval=30s timeout=3s start-period=5s
|
||||
retries=3`, which means a failing healthcheck takes about 95 seconds
|
||||
to mark the container unhealthy. Tasks dying at the 95-second mark is
|
||||
the signature of a broken healthcheck.
|
||||
|
||||
## Cluster context
|
||||
|
||||
| Component | Value |
|
||||
| --------- | ----- |
|
||||
| Cluster | SOL (4× Raspberry Pi 4, ARM64) |
|
||||
| Manager | sol0 |
|
||||
| Registry | `git.bennu.duckdns.org/jshackney/pilot-toolkit-web` |
|
||||
| Service | `ptk_ptk` (in stack `ptk`) |
|
||||
| URL | `https://ptk.bennu.duckdns.org` |
|
||||
| Stack file location on sol0 | `/root/ptk-stack.yml` |
|
||||
Loading…
Reference in a new issue