← All posts

This blog runs on my homelab. Here is the architecture.

#kubernetes#nextjs#argocd#cloudflare#homelab#gitops

TL;DR

nacholar.com (Next.js, App Router) runs on my homelab k8s cluster. No Vercel, no public ingress, no cert-manager. Cloudflare Tunnels handle ingress, ArgoCD plus sha-pinned Helm values handle deploys, and a private Harbor registry on the LAN is bridged to GitHub Actions over Tailscale. The hardest part wasn't k8s. It was teaching containerd to pull from a self-signed LAN registry through node DNS.


What this is and who it's for

This is the blog you're reading.

The blog itself is unremarkable: a Next.js App Router app, no database, content in markdown. The interesting part is what it runs on: a homelab k8s cluster I own end to end, with a deploy pipeline I own end to end. No managed platform anywhere in the path from git push to a served request.

This post is for engineers who want to see what it actually takes to make a homelab cluster a serious deploy target, not "minikube on a beefy laptop," but a real setup with private registry, GitOps, CI that lives outside the LAN, and ingress that doesn't require opening ports on a residential connection.

Vercel is genuinely good. I'm not arguing against it. I'm running the blog this way because I already pay for the cluster, I want to own the stack end to end, and "Cloud Architect who ships" carries more weight when the thing shipping the words is something I built and run.


The architecture

Topology

Internet
  └─ Cloudflare edge (TLS, HTTP/3, Brotli, cache)
       └─ outbound tunnel
             └─ cloudflared Deployment (2 replicas)
                   └─ Service: nacholar-blog (ClusterIP :80)
                         └─ Pod: node server.js (Next standalone :3000)

Everything from the tunnel inward is private. The cluster never listens on a public port.

Deploy loop

Push to main → GitHub Actions builds the container, joins the tailnet, pushes to Harbor → CI overwrites the image tag in the Helm values file and commits back to main with [skip ci] → ArgoCD's reconcile loop picks up the diff and rolls out a new pod → traffic continues flowing through the same cloudflared Deployment.

Git is the deploy log. Reverting a bot commit rolls back to the previous image.

Stack

  • Harbor: private registry on LAN
  • Tailscale: bridges GitHub Actions to LAN
  • ArgoCD: GitOps controller
  • Helm: chart for the app and the cloudflared Deployment
  • Cloudflare Tunnels: ingress
  • Next.js standalone output: packaged in a multi-stage container

The image

Next.js standalone bundles only the server files needed to run the app: a single server.js plus a minimal node_modules, no full install in production. The Dockerfile is a three-stage build:

FROM node:24-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --no-audit --no-fund

FROM node:24-alpine AS builder
ENV NEXT_TELEMETRY_DISABLED=1
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM node:24-alpine AS runner
ENV NODE_ENV=production NEXT_TELEMETRY_DISABLED=1 PORT=3000 HOSTNAME=0.0.0.0
RUN addgroup -S -g 1001 nodejs && adduser -S -u 1001 -G nodejs nextjs
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
COPY --from=builder --chown=nextjs:nodejs /app/public ./public
RUN cd /app && npm install --no-audit --no-fund --omit=dev sharp \
    && chown -R nextjs:nodejs /app/node_modules
USER nextjs
EXPOSE 3000
CMD ["node", "server.js"]

Two non-obvious details:

  • sharp is installed in the runner stage, not just at build time, because the Next image optimizer needs it at runtime.
  • The non-root user (uid 1001) is baked into the image, not configured at the pod level. It's easier than fighting runAsUser later.

The chart

Deployment + Service + optional ServiceAccount. The security context is tight on purpose:

containerSecurityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1001
  runAsGroup: 1001
  fsGroup: 1001

readOnlyRootFilesystem: true will crash the image optimizer the first time anyone requests an optimized image: it tries to write to .next/cache. The fix is two emptyDir mounts:

volumes:
  - name: tmp
    emptyDir: {}
  - name: next-cache
    emptyDir: {}

volumeMounts:
  - name: tmp
    mountPath: /tmp
  - name: next-cache
    mountPath: /app/.next/cache

Without those, the app starts and silently fails on the first image request. With them, a read-only root filesystem costs you nothing.


Interesting engineering decisions

Four of them, each a real fork in the road.

1. No public ingress: Cloudflare Tunnels

The constraint: my ISP doesn't hand out static IPs, I don't want to open ports, and I don't want to run cert-manager + ACME against a residential connection. The default k8s answer (LoadBalancer Service + ingress controller + Let's Encrypt) is wrong for this environment. It assumes you control the network in front of the cluster, which I don't.

Cloudflare Tunnels invert ingress. cloudflared runs inside the cluster as a Deployment (two replicas) and opens an outbound connection to Cloudflare's edge. Cloudflare routes nacholar.com through that tunnel to a ClusterIP Service, which fronts the blog pod. There's no inbound connection from the public internet to my house. There's no public port open on the cluster. TLS, HTTP/3, Brotli, and edge cache are Cloudflare's problem.

containers:
  - name: cloudflared
    image: cloudflare/cloudflared:2026.3.0
    command: [cloudflared, tunnel, --no-autoupdate, --loglevel, info, --metrics, 0.0.0.0:2000, run]
    env:
      - name: TUNNEL_TOKEN
        valueFrom:
          secretKeyRef:
            name: tunnel-token
            key: token
    livenessProbe:
      httpGet:
        path: /ready
        port: 2000
      failureThreshold: 1
      initialDelaySeconds: 10
      periodSeconds: 10

Token-based config plus dashboard-driven public hostnames means no ConfigMap, no credentials.json on disk. The tunnel ingress lives in the Cloudflare dashboard under Zero Trust → Networks → Tunnels.

The tradeoff is honest: tight Cloudflare dependency. If their edge has an outage, the blog is down. For a personal blog this is fine. For a paying-customer product I'd want a fallback origin.

One caching gotcha worth knowing: Cloudflare doesn't cache HTML from tunnels by default. You have to (a) set s-maxage on responses from the origin, and (b) create a Cache Rule in the dashboard that opts the hostname into using origin cache-control. Without the rule, s-maxage is ignored on tunnel traffic. Verify with curl -sI https://nacholar.com/some-post | grep cf-cache-status. The second request should return HIT.

2. CI reaches a LAN registry: Tailscale + insecure registry config

Harbor lives at harbor.personal.k8s (192.168.100.71) on my LAN. GitHub Actions runners don't have LAN access. The naive answer is "expose Harbor publicly," but that means another public ingress and another real cert. The whole point of this stack is to keep public surface area tiny.

Tailscale does the bridge. The cluster advertises 192.168.100.0/24 via a subnet router. The CI job joins the tailnet using an OAuth client tagged tag:ci, accepts the routes, and Harbor becomes addressable from the runner like it's on the same network.

Two gotchas, because Harbor's default cert is self-signed:

  1. /etc/docker/daemon.json needs {"insecure-registries": ["harbor.personal.k8s"]} written before docker buildx starts.
  2. The buildx setup action needs buildkitd-config-inline with http = true and insecure = true for the registry.

Either step missing produces TLS errors that look like DNS failures. Diagnosing this the first time took embarrassingly long.

A separate gotcha that's pure GitHub Actions, not infra: Harbor credentials live in a Production environment, not at repo level. If the job doesn't declare environment: Production, the secrets resolve as empty strings and the login fails with "Username and password required", with no hint that the missing environment declaration is the cause.

3. Image deploys without :latest: sha-pinned Helm values

ArgoCD reconciles by diffing rendered manifests against cluster state. If your image tag is :latest or :main, the rendered manifest doesn't change between deploys. ArgoCD sees no diff and does nothing, even though the registry has a new image under the same tag.

The fix is to write the git sha into the Helm values file on every CI run:

# deploy/chart/values.yaml
image:
  repository: harbor.personal.k8s/library/nacholar-blog
  tag: main   # CI overwrites this with the short sha

The last step in CI:

SHORT="${GITHUB_SHA::7}"
yq -i ".image.tag = \"${SHORT}\"" deploy/chart/values.yaml
git add deploy/chart/values.yaml
git commit -m "chore(deploy): bump image to ${SHORT} [skip ci]"
git push

[skip ci] keeps the bot's commit from re-triggering the workflow. ArgoCD's reconcile loop picks up the change within minutes, re-renders the chart, and rolls out a new pod. The git history is the deploy history. To roll back, revert the bot commit.

The alternative I considered: ArgoCD Image Updater, which can watch a registry and bump tags on its side. I didn't use it for two reasons: it adds another controller to keep alive, and the write-back pattern keeps the deploy record visible in git log. For a fleet of services, Image Updater starts to win. For one blog, the bash script wins.

4. Nodes pull from a private LAN registry: the containerd config_path saga

This was the most painful part of the exercise and the most useful thing I learned. Three problems, each manifesting as the same opaque "failed to pull image" symptom, each requiring a different fix, and the fixes have to be applied in order.

Problem 1: kubelet doesn't use CoreDNS.

Pods resolve DNS through CoreDNS. The kubelet doesn't. It uses the node's /etc/resolv.conf. My nodes were provisioned with 8.8.8.8 as their nameserver, which has no idea what harbor.personal.k8s is.

Symptom: dial tcp: lookup harbor.personal.k8s on 8.8.8.8:53: no such host.

Fix: add the entry to /etc/hosts on every node. Adding a hosts plugin to CoreDNS does not help: that's pod DNS, not node DNS.

Problem 2: containerd doesn't trust Harbor's self-signed cert.

Symptom: x509: certificate is valid for core.harbor.domain, not harbor.personal.k8s.

Harbor's default cert is issued for core.harbor.domain. Even with DNS resolving, the cert won't validate.

Fix: a per-registry hosts.toml that tells containerd to skip TLS verification for this registry only:

# /etc/containerd/certs.d/harbor.personal.k8s/hosts.toml
server = "http://harbor.personal.k8s"

[host."http://192.168.100.71"]
  capabilities = ["pull", "resolve"]
  skip_verify = true

Problem 3: containerd won't read that file unless config_path is set.

Creating the file isn't enough. Containerd only reads per-registry hosts.toml files when config_path is configured in its main config. My nodes had no registry section at all.

Fix:

# append to /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

The full sequence per node:

# 1. DNS
echo '192.168.100.71 harbor.personal.k8s' | sudo tee -a /etc/hosts

# 2. Per-registry hosts.toml
sudo mkdir -p /etc/containerd/certs.d/harbor.personal.k8s
sudo tee /etc/containerd/certs.d/harbor.personal.k8s/hosts.toml << 'EOF'
server = "http://harbor.personal.k8s"

[host."http://192.168.100.71"]
  capabilities = ["pull", "resolve"]
  skip_verify = true
EOF

# 3. Enable config_path
sudo tee -a /etc/containerd/config.toml << 'EOF'

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"
EOF

# 4. Restart
sudo systemctl restart containerd

# 5. Verify
sudo crictl pull harbor.personal.k8s/library/nacholar-blog:latest

In production this is provisioning-time work: Ansible, cloud-init, Talos config, whatever you're using. In a homelab it's a one-time SSH session per node. I lost more time to these three problems than to every other piece of the stack combined.


What I'd do differently

Four things, ordered by how much they'd actually bite.

1. Bake node config into provisioning, not post-hoc SSH. Right now /etc/hosts and the containerd config on each node are edited by hand the first time a node joins. If I rebuild a node, I redo it. The right fix is an Ansible playbook or cloud-init userdata on the Proxmox templates. The cluster is small and stable enough that this hasn't bitten me, but it will bite the day I'm tired and replacing a failed node at 11 PM.

2. Give Harbor a real cert. skip_verify = true is fine for a homelab cluster behind Tailscale, but it's a smell. The right answer is cert-manager with an internal CA, or a Let's Encrypt cert via DNS-01 if Harbor were publicly reachable, which it isn't.

3. Extract the CI workflow. Every app I deploy is going to need the same Tailscale + Harbor + GitOps loop. Right now it's a single workflow file in this repo. The next step is a reusable workflow in a central repo and a thin caller per app.

4. Reconsider Image Updater if the fleet grows. Write-back to values.yaml from CI is fine for one app. At five or ten it gets tedious, and ArgoCD Image Updater starts to look like the right trade.


Status and what's next

It's running. Lighthouse scores match Vercel: the rendering work is identical, Cloudflare's edge handles the last mile the same way regardless of origin. Marginal cost is roughly zero: the cluster is already paid for, the Cloudflare tunnel is free tier, Tailscale is free at this scale.

Next up: extract the CI workflow into a reusable pattern, give Harbor a real cert, and move the next project onto the same cluster, the real test of whether the pattern scales past one blog.

"Cloud Architect who ships" is empty unless there's something shipping. This is one of those things. You're reading it through the architecture above.