Kubernetesk3sArgoCDDevOpsself-hostingimage-updaterGitOpslessons learnedafrotomation startupindie hacking

Three Days In: The Post-Migration Trenches of a Self-Hosted k3s Cluster

Three days after standing up a k3s cluster for 47 apps, the bugs started arriving. SSE reconnect storms, image-updater silent no-ops, billing limits on every free tier, and a Hashnode publication about to give a 'Cluster Not Found' error. Notes from the trenches.

AfrotomationApril 26, 20269 min read

I shipped two retrospectives in the last week. The first was moving 47 apps off Vercel onto a self-hosted Coolify cluster. The second was tearing that down nine days later and rebuilding on k3s. Both ended with the line every infra retrospective ends with: "and now everything works."

Three days later, here are the things that did not work.

This is the first post in the Afrotomation Startup series — the ongoing, unfiltered version of building this in public. The migration retrospectives were the highlight reel. This is the trench warfare.

1. The image-updater silent no-op

I was proud of the GitOps loop:

git push  →  GHA build  →  GHCR :latest  →  image-updater bumps tag in Git  →  ArgoCD syncs  →  pod rolls

Three days post-migration, I sat down to verify a PR was actually live in the cluster. It wasn't. The pod was running an image from two days earlier — when its node had last been rebooted and it had pulled :latest for the first time.

I checked image-updater. It had never written a single commit to afrotomation-infra. Across 47 apps. Since day one.

The culprit was a two-part configuration bug that I want to write about plainly because both halves are easy to miss:

Part one: wrong update strategy. Every Application annotation said:

argocd-image-updater.argoproj.io/app.update-strategy: latest

The "latest" strategy (newer docs call it newest-build) ranks tags by ctime and picks the most recent. If the only tag in your registry is :latest, that strategy sees one tag, finds nothing newer than itself, and silently no-ops. Forever.

The correct strategy for a rolling tag like :latest is digest. It re-fetches the manifest digest of :latest on every poll cycle and writes back when it changes.

Part two: the tag has to be in the image-list. With update-strategy: digest, the Application must explicitly tell image-updater which tag to inspect. Our annotation was:

argocd-image-updater.argoproj.io/image-list: app=ghcr.io/codenificient/<repo>

Notice the missing :latest. With no tag specified, digest strategy has nothing to inspect — silent no-op. Same outcome.

The fix is two lines per Application:

-    argocd-image-updater.argoproj.io/image-list: app=ghcr.io/codenificient/<repo>
-    argocd-image-updater.argoproj.io/app.update-strategy: latest
+    argocd-image-updater.argoproj.io/image-list: app=ghcr.io/codenificient/<repo>:latest
+    argocd-image-updater.argoproj.io/app.update-strategy: digest

Plus a Helm values override on the image-updater chart itself:

config:
  gitCommitUser: argocd-image-updater
  gitCommitMail: [email protected]

Without those, git's "empty ident" guard rejects every commit before it leaves the pod. Another silent failure mode.

Lesson: every silent no-op in a distributed pipeline is a configuration problem hiding behind an empty log line. When something "should be working" and isn't, look for the place where the system shrugs instead of erroring.

2. The SSE reconnect storm in clickrise

I opened the dashboard of ClickRise and looked at the network tab on a fully idle page. Dozens of requests. Per minute. Doing nothing useful.

Two anti-patterns compounding:

useRealtimeUpdates declared all 10 callback props in its useEffect deps array:

useEffect(() => {
  const eventSource = new EventSource(url)
  // ... onmessage handlers using onTaskCreated, onCommentCreated, etc.
  return () => eventSource.close()
}, [session?.user?.id, enabled, projectId, taskId,
    onEvent, onTaskCreated, onTaskUpdated, /* …7 more */])

If the parent passed inline arrow functions (which it did — every time), each render produced 10 fresh function references. Effect re-fired. SSE closed. New SSE opened. That's a fresh /api/realtime/stream HTTP request on every parent render.

AppContext's value object was constructed fresh on every render too — new object identity → every consumer's useAppContext() re-rendered → every effect that depended on setCurrentOrganization (whose own identity churned because workspaces was a fresh array each render) re-fired.

Compound effect: a navigation event would cascade into ~30 unnecessary HTTP requests in under a second.

Fix is the textbook ref pattern:

const handlersRef = useRef({ onEvent, onTaskCreated, /* … */ })
useEffect(() => {
  handlersRef.current = { onEvent, onTaskCreated, /* … */ }
}, [onEvent, onTaskCreated, /* … */])

useEffect(() => {
  const eventSource = new EventSource(url)
  eventSource.onmessage = (e) => handlersRef.current.onEvent(parse(e.data))
  return () => eventSource.close()
}, [session?.user?.id, enabled, projectId, taskId])
// ↑ no callbacks in deps — see ref pattern above

Plus useMemo on the AppContext value object.

Lesson: SSE and other long-lived connections want connection identity in the deps array, not callback identity. Mixing them makes the connection-opening effect re-fire on every render.

3. URL params getting wiped on every navigation

Same project, different bug. After the auth migration, every sidebar click on the dashboard wiped the org and workspace from the URL. Child components reading searchParams.get('workspace') got null and fetched empty data.

The commit history showed someone (an earlier Claude Opus version, in fact — I'm not above admitting it) had fixed this back in October 2025 with a needsWorkspace flag on the sidebar nav items + a URL-sync routine in AppContextProvider.initializeSelections. The Better Auth migration in early 2026 rewrote app-context.tsx and the URL-sync routine didn't make the cut.

The fix: a tiny useContextHref() hook that takes any href and prepends ?organization=…&workspace=…, plus a corresponding router.replace() in initializeSelections so the URL has the params from the very first frame. Three files, twenty lines, two hours of staring at React DevTools.

Lesson: when you do a major dependency migration, audit every file that the migration touched for old-but-correct code that the new pattern accidentally deleted. URL state is invisible until it isn't.

4. The free-tier ceilings I keep hitting

Half of the time I save by self-hosting infrastructure I lose to free-tier ceilings on services I don't yet self-host:

  • GitHub Actions. Three days after the k3s cutover I started getting "The job was not started because recent account payments have failed or your spending limit needs to be increased." on every workflow run. I'm building 47 Docker images per merge wave. I'm out of free minutes. I'm now standing up self-hosted GHA runners on the same Contabo VPS 50 that runs the k3s workloads — the box has 16 vCPUs and 64 GB of RAM, and the workloads themselves use barely a quarter of that on a quiet day, so allocating 6 vCPUs and 16 GB of RAM to a runner pool costs me nothing in headroom and replaces a billing line item entirely.
  • Microlink. The portfolio page renders live screenshots of every project via Microlink. At 47 cards, the homepage burns through Microlink's hourly quota in one viral tweet. I've already moved to direct-Microlink-URLs-with-unoptimized so the browser hits Microlink once per visitor IP rather than my server hitting it 47× per request, but the quota math still doesn't scale.
  • Backblaze B2. The pgBackRest backups for CloudNativePG work great. The first 10 GB/month are free. I'm under that today. I won't be next quarter.
  • Hashnode's free publication tier. Three series, twelve posts, no custom analytics — fine for now. Will need to revisit when the series count grows.

I don't begrudge any of these companies their free tiers — they're how I built the runway for Afrotomation in the first place. But the consistent pattern is: every "free for hobbyists" tier hits a wall right around the point where my project starts looking like a real product. That's correct from their business angle and correct from mine. The growing-up part is recognizing the wall before you hit it and deciding which ones to self-host and which ones to pay for.

This week's plan: self-host the GHA runners, keep paying Cloudflare and Hashnode (they're earning it), keep B2 free until I exceed it, and take Microlink off the critical path entirely by caching screenshots in Cloudinary on a daily cron.

5. The two-day Tailscale-on-Mac outage

Mid-Coolify-to-k3s migration my Mac's Tailscale CLI just stopped working with Tailscale.CLIError error 1. I lost the ability to kubectl into the cluster from my laptop. My first instinct was to swap to Headscale. The Claude session I was working with talked me out of it: "don't rat-hole on this tonight, you have 47 apps to migrate." It was right.

Workaround that's still in place three days later: I keep a ~/.kube/afrotomation-prod-public.yaml kubeconfig with public IPs in the server: URLs (instead of tailnet IPs). When Tailscale on Mac dies, I lose mTLS-over-WireGuard but I keep cluster access via Cloudflare and the firewall ACL. Not ideal. Fixable. Deferred.

Lesson: during a migration, the right answer to "should I also fix this other thing while I'm here" is almost always no.

6. Hashnode about to render this post with a stale "Cluster Not Found" cover

I almost didn't catch this one. The Microlink screenshot of clickrise.afrotomation.com that I'd uploaded as the cover for one of the migration retrospectives was taken during the k3s pivot, when the ingress was returning a "Cluster Not Found" error. Hashnode dutifully cached it. The cover image on a post titled "We Successfully Migrated to k3s" was a screenshot of k3s being broken.

Fix was a 30-second Cloudinary re-upload. But the lesson generalizes: any preview/screenshot pipeline that runs once and caches forever will, eventually, capture and immortalize your worst moment.

I now run a Cloudinary re-cache cron weekly for every portfolio cover. Costs nothing. Saves me from writing the same blog post about my own broken homepage twice.

What's actually working

Despite the above, the picture is good. Of the 47 apps I migrated:

  • All 47 are reachable. Different subdomains, different stacks, all behind the same ingress-nginx + cert-manager pair, all served from the same vps50 Contabo node.
  • The CloudNativePG cluster has held up. Three nodes, one primary, one synchronous replica on Oracle, one async replica on vps10. PITR shipping to B2 every minute. The 17 Neon dumps I took as cold backups have not been needed.
  • GitOps actually works now that image-updater is fixed. A merge to main on a feature repo lands in the cluster within ~5 minutes. Auditable in git log. Reversible by git revert. The Coolify-shaped hole in my workflow is filled.
  • Total infra spend is still under $50/month all-in for the cluster nodes and DNS. Less than half of what Vercel + Neon was costing at month 47.

What's next

This series — Afrotomation Startup — will be the running log of building Afrotomation past the bootstrap phase. Less "how I migrated" (already wrote those), more:

  • "I built the wrong thing this week and tore it down on Saturday"
  • "This integration broke in production and here's what I learned"
  • "I'm running out of free tier on X, here's what I'm replacing it with"
  • "Here's what shipped and how it's actually being used"

Subscribe to the publication if you want the next one in your inbox. Send me your own war stories on Twitter. The trenches are more fun with company.