pilot-toolkit-web/DEPLOY.md

4.7 KiB
Raw Permalink Blame History

Pilot Toolkit Web — Deploy Workflow

This document describes the workflow for deploying changes from the development machine to the SOL cluster. Keep it close.

Quick Reference: source change to deployed

  1. Edit source on dev machine (~/pilot-toolkit-web)
  2. Commit and push to git
    git add -A
    git commit -m "Description of change"
    git push
    
  3. SSH to sol0 and pull
    ssh sol0
    cd ~/pilot-toolkit-web
    git pull
    
  4. Rebuild image
    docker build -t git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest .
    
  5. Push to registry
    docker push git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest
    
    If it says denied:, refresh login: docker login git.bennu.duckdns.org
  6. Trigger rolling update
    sudo docker service update --force --with-registry-auth \
      --image git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
      ptk_ptk
    
  7. Verify
    sudo docker service ps ptk_ptk
    
    Tasks should reach Running and stay there past ~2 minutes.

Mental model

Three states must stay in sync:

  1. Source — your code on disk and in git
  2. Image — built artifact in local Docker (created by docker build)
  3. Registry — uploaded image in Forgejo (uploaded by docker push)

Editing source changes #1 only. You must build to update #2 and push to update #3. Swarm deploys from #3, not from git.

A real rebuild produces some layers that say Pushed. If docker push shows Layer already exists for every layer, no new build happened.

When things go wrong

Symptom Likely cause Fix
docker push says denied: Stale login session docker login git.bennu.duckdns.org
docker push says HTTP 500 Forgejo registry hiccup Retry 2-3 times; works incrementally
Update says "could not be accessed on a registry" Worker nodes can't auth Use --with-registry-auth flag
Tasks die at exactly 95 seconds Healthcheck failing Test healthcheck manually inside running container
Tasks die immediately Container itself crashes docker service logs ptk_ptk
Service runs but URL gives 404 Traefik routing issue Check Traefik labels in stack.yml match Host()
Service runs but URL gives 503 Traefik can't reach container Check container is on cluster-net
docker service ps shows old timestamps Looking at task history, not current Add --filter desired-state=running

Verification commands

# What image is the service currently configured to run?
sudo docker service inspect ptk_ptk --pretty | grep -A1 Image

# What's the healthcheck inside the locally-built image?
docker inspect git.bennu.duckdns.org/jshackney/pilot-toolkit-web:latest \
  --format '{{json .Config.Healthcheck}}'

# Show only currently-running tasks (not historical)
sudo docker service ps ptk_ptk --filter desired-state=running

# Tail logs from running tasks
sudo docker service logs -f --tail 50 ptk_ptk

Stack management

# Show running services
sudo docker service ls

# Tear down stack entirely
sudo docker stack rm ptk

# Redeploy from scratch
sudo docker stack deploy --with-registry-auth -c /root/ptk-stack.yml ptk

# Force restart without changes
sudo docker service update --force ptk_ptk

Healthcheck notes

The healthcheck must use 127.0.0.1, not localhost. BusyBox wget inside Alpine resolves localhost to its IPv6 address (::1) first, and nginx is only listening on IPv4. Using the explicit IPv4 address sidesteps the resolution issue entirely.

The healthcheck timing is interval=30s timeout=3s start-period=5s retries=3, which means a failing healthcheck takes about 95 seconds to mark the container unhealthy. Tasks dying at the 95-second mark is the signature of a broken healthcheck.

Cluster context

Component Value
Cluster SOL (4× Raspberry Pi 4, ARM64)
Manager sol0
Registry git.bennu.duckdns.org/jshackney/pilot-toolkit-web
Service ptk_ptk (in stack ptk)
URL https://ptk.bennu.duckdns.org
Stack file location on sol0 /root/ptk-stack.yml