Troubleshooting k3s Agent Container Errors: A Real-World Fix
Recently, I ran into a pesky issue on one of my k3s nodes, a server I’ll call NEXUS, running in a homelab cluster. The k3s agent logs were filling up with errors about missing containers, and I couldn’t ignore them any longer. If you’ve seen messages like Failed to create existing container: ... task ... not found
in your k3s logs, this post is for you. I’ll walk you through how I diagnosed and resolved the issue, with practical steps to get your cluster back on track.
The Problem: Missing Container Errors
The logs on NEXUS looked something like this (I’ve anonymized the details):
Apr 13 23:51:57 NEXUS k3s[1234]: E0413 23:51:57.033807 1234 manager.go:1116] Failed to create existing container: /kubepods.slice/.../cri-containerd-abcd1234.scope: task abcd1234 not found
Apr 13 23:52:00 NEXUS k3s[1234]: E0413 23:52:00.140450 1234 manager.go:1116] Failed to create existing container: /kubepods.slice/.../cri-containerd-efgh5678.scope: task efgh5678 not found
These errors suggested that k3s was trying to manage containers that no longer existed in the container runtime (containerd, in my case). It was as if k3s and containerd were out of sync, with k3s holding onto references to “ghost” containers.
Why This Happens
This issue can pop up for a few reasons:
Stale State: k3s thinks a container exists, but containerd has already cleaned it up.
Runtime Issues: containerd might have crashed or failed to track tasks properly.
Node Disruptions: A reboot, disk pressure, or network hiccup could desynchronize the cluster.
In my case, NEXUS had been running smoothly until a recent power cycle during some network maintenance on my 192.168.10.0/24 subnet. I suspected the abrupt restart left k3s and containerd in a mismatched state.
Step-by-Step Diagnosis and Fix
Here’s how I tackled the problem, with commands you can adapt to your setup.
1. Check the Environment
I started by confirming the k3s version:
k3s --version
Output: k3s version v1.27.8+k3s1
. Not the latest, but recent enough. NEXUS was an agent node in a cluster with two servers (let’s call them ALPHA and BETA, reachable at 10.0.0.10:6443). I also verified containerd was running:
systemctl status containerd
It was active, but I checked its logs for clues:
journalctl -u containerd
Nothing obvious, so I moved on.
2. Inspect k3s Health
The k3s service was up:
systemctl status k3s
But the errors kept piling up. I checked the cluster’s view of NEXUS:
kubectl get nodes
NEXUS was Ready
, so the issue wasn’t critical enough to break workloads, but it was annoying.
3. Peek at Containers
I used containerd’s CLI (ctr
) to see what containers were actually running:
ctr -n k8s.io tasks ls
ctr -n k8s.io containers ls
None of the task IDs from the logs (like abcd1234
) showed up. This confirmed my suspicion: k3s was trying to manage containers that containerd had already forgotten.
4. Clean Up the Mess
To fix the desync, I decided to clear out the stale state. Here’s what I did:
Stop k3s:
systemctl stop k3s
Clear containerd’s k8s namespace:
I listed all containers in the k8s.io namespace:
ctr -n k8s.io containers ls
Since I was okay with potentially restarting workloads, I deleted everything (use caution here):
ctr -n k8s.io containers ls -q | xargs ctr -n k8s.io containers delete
Restart services:
systemctl restart containerd
systemctl start k3s
5. Verify the Fix
After restarting, I checked the logs:
journalctl -u k3s
The errors were gone! I also confirmed pods were running:
kubectl get pods --all-namespaces
Everything looked healthy, and NEXUS was back to normal.
Preventing Future Issues
To avoid this headache again, I took a few steps:
Upgraded k3s:
curl -sfL https://get.k3s.io | sh -
The latest version might have fixes for edge cases like this.
Monitored Disk Space:
df -h
NEXUS had plenty of space, but I set up alerts to catch issues early.
Checked Clock Sync:
timedatectl
My NTP server (10.0.0.5) was keeping things in sync, but it’s worth double-checking.
When to Escalate
I find myself doing this pretty frequently in my more experimental lab environments. In a lot of cases, it does help. k3s keeps behind a lot of state to help recovery, but in some cases this is extra baggage that can prevent a clean rebuild. Make sure you always backup. If the above doesn’t work, try these:
Rejoin the Node:
Stop k3s, uninstall it (k3s-agent-uninstall.sh), and rejoin the cluster using the token from ALPHA or BETA.
Dig Deeper:
Collect logs (journalctl -u k3s > k3s.log) and search the k3s GitHub issues for similar reports.
Backup First:
If you’re clearing state, back up /var/lib/rancher/k3s
to avoid losing critical data.
Final Thoughts
This issue was a reminder that lightweight Kubernetes distros like k3s
, while awesome for small clusters, can still throw curveballs. By methodically checking containerd
, k3s
, and the cluster state, I got NEXUS back online without major disruption. If you hit similar errors, don’t panic—just follow the steps above, and you’ll likely sort it out.
Have you run into this issue before? Drop a comment