Troubleshooting k3s Agent Container Errors: A Real-World Fix

in #k3s21 days ago

Troubleshooting k3s Agent Container Errors: A Real-World Fix

Recently, I ran into a pesky issue on one of my k3s nodes, a server I’ll call NEXUS, running in a homelab cluster. The k3s agent logs were filling up with errors about missing containers, and I couldn’t ignore them any longer. If you’ve seen messages like Failed to create existing container: ... task ... not found in your k3s logs, this post is for you. I’ll walk you through how I diagnosed and resolved the issue, with practical steps to get your cluster back on track.

The Problem: Missing Container Errors

The logs on NEXUS looked something like this (I’ve anonymized the details):

Apr 13 23:51:57 NEXUS k3s[1234]: E0413 23:51:57.033807    1234 manager.go:1116] Failed to create existing container: /kubepods.slice/.../cri-containerd-abcd1234.scope: task abcd1234 not found
Apr 13 23:52:00 NEXUS k3s[1234]: E0413 23:52:00.140450    1234 manager.go:1116] Failed to create existing container: /kubepods.slice/.../cri-containerd-efgh5678.scope: task efgh5678 not found

These errors suggested that k3s was trying to manage containers that no longer existed in the container runtime (containerd, in my case). It was as if k3s and containerd were out of sync, with k3s holding onto references to “ghost” containers.

Why This Happens

This issue can pop up for a few reasons:
Stale State: k3s thinks a container exists, but containerd has already cleaned it up.
Runtime Issues: containerd might have crashed or failed to track tasks properly.
Node Disruptions: A reboot, disk pressure, or network hiccup could desynchronize the cluster.

In my case, NEXUS had been running smoothly until a recent power cycle during some network maintenance on my 192.168.10.0/24 subnet. I suspected the abrupt restart left k3s and containerd in a mismatched state.

Step-by-Step Diagnosis and Fix

Here’s how I tackled the problem, with commands you can adapt to your setup.

1. Check the Environment

I started by confirming the k3s version:

k3s --version

Output: k3s version v1.27.8+k3s1. Not the latest, but recent enough. NEXUS was an agent node in a cluster with two servers (let’s call them ALPHA and BETA, reachable at 10.0.0.10:6443). I also verified containerd was running:

systemctl status containerd

It was active, but I checked its logs for clues:

journalctl -u containerd

Nothing obvious, so I moved on.

2. Inspect k3s Health

The k3s service was up:

systemctl status k3s

But the errors kept piling up. I checked the cluster’s view of NEXUS:

kubectl get nodes

NEXUS was Ready, so the issue wasn’t critical enough to break workloads, but it was annoying.

3. Peek at Containers

I used containerd’s CLI (ctr) to see what containers were actually running:

ctr -n k8s.io tasks ls
ctr -n k8s.io containers ls

None of the task IDs from the logs (like abcd1234) showed up. This confirmed my suspicion: k3s was trying to manage containers that containerd had already forgotten.

4. Clean Up the Mess

To fix the desync, I decided to clear out the stale state. Here’s what I did:
Stop k3s:

systemctl stop k3s

Clear containerd’s k8s namespace:
I listed all containers in the k8s.io namespace:

ctr -n k8s.io containers ls

Since I was okay with potentially restarting workloads, I deleted everything (use caution here):

ctr -n k8s.io containers ls -q | xargs ctr -n k8s.io containers delete

Restart services:

systemctl restart containerd
systemctl start k3s

5. Verify the Fix

After restarting, I checked the logs:

journalctl -u k3s

The errors were gone! I also confirmed pods were running:

kubectl get pods --all-namespaces

Everything looked healthy, and NEXUS was back to normal.

Preventing Future Issues

To avoid this headache again, I took a few steps:
Upgraded k3s:

curl -sfL https://get.k3s.io | sh -

The latest version might have fixes for edge cases like this.

Monitored Disk Space:

df -h

NEXUS had plenty of space, but I set up alerts to catch issues early.

Checked Clock Sync:

timedatectl

My NTP server (10.0.0.5) was keeping things in sync, but it’s worth double-checking.

When to Escalate

I find myself doing this pretty frequently in my more experimental lab environments. In a lot of cases, it does help. k3s keeps behind a lot of state to help recovery, but in some cases this is extra baggage that can prevent a clean rebuild. Make sure you always backup. If the above doesn’t work, try these:
Rejoin the Node:
Stop k3s, uninstall it (k3s-agent-uninstall.sh), and rejoin the cluster using the token from ALPHA or BETA.

Dig Deeper:
Collect logs (journalctl -u k3s > k3s.log) and search the k3s GitHub issues for similar reports.

Backup First:
If you’re clearing state, back up /var/lib/rancher/k3s to avoid losing critical data.

Final Thoughts

This issue was a reminder that lightweight Kubernetes distros like k3s, while awesome for small clusters, can still throw curveballs. By methodically checking containerd, k3s, and the cluster state, I got NEXUS back online without major disruption. If you hit similar errors, don’t panic—just follow the steps above, and you’ll likely sort it out.
Have you run into this issue before? Drop a comment