Troubleshooting Kubernetes DNS Issues in a WSL2 Node: A Tale of Flannel and Routes

in #kubernetes7 days ago

Troubleshooting Kubernetes DNS Issues in a WSL2 Node: A Tale of Flannel and Routes

Running a Kubernetes cluster across multiple environments is always an adventure, and when one of those environments is WSL2, things can get particularly spicy. Recently, I hit a perplexing issue in my k3s cluster where one node—a WSL2 instance—was stubbornly unable to resolve DNS via kube-dns, while the rest of the cluster hummed along just fine. This post dives into the rabbit hole of diagnosing this problem, exploring the twists and turns of Flannel’s networking, WSL2’s quirks, and Kubernetes’ service routing. While I haven’t cracked the case yet, I’ll share the journey so far—complete with logs, pings, and a healthy dose of head-scratching.

The Setup: A Mixed-OS Kubernetes Cluster

My k3s cluster spans a handful of nodes with diverse operating systems, reflecting the chaotic beauty of a homelab:
dragonfly (WSL2): A worker node running Ubuntu 20.04 on WSL2, IP 192.168.45.101, part of my Windows 11 machine. This is the troublemaker.

phoenix: A control-plane node on Fedora 40, IP 10.100.1.50, hosting critical services like CoreDNS.

griffin and kraken: Worker nodes on Arch Linux and k3OS, IPs 10.100.1.75 and 10.100.1.88, respectively.

chimera: Another control-plane node on k3OS, IP 10.100.1.62.

The cluster uses k3s, a lightweight Kubernetes distribution, with Flannel as the CNI for pod networking. The pod CIDR is 10.244.0.0/16, and the service CIDR is 10.245.0.0/16. On dragonfly, Flannel configures the pod subnet as 10.244.7.0/24, with the flannel.1 interface sporting an IP of 10.244.7.0 and an MTU of 1350 to account for VXLAN overhead.
The cluster has been running smoothly for months, with kube-dns (ClusterIP 10.245.0.10) reliably resolving service names. That is, until dragonfly decided to throw a wrench in the works.

The Problem: DNS Silence on dragonfly

One day, I noticed that workloads on dragonfly couldn’t resolve DNS. A quick test confirmed the issue:

root@dragonfly:~# nslookup kubernetes.default 10.245.0.10
;; connection timed out; no servers could be reached

Pinging thekube-dns service IP didn’t fare any better:

root@dragonfly:~# ping 10.245.0.10
PING 10.245.0.10 (10.245.0.10) 56(84) bytes of data.
^C
--- 10.245.0.10 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2032ms

Curiously, the same commands from phoenix or griffin worked perfectly, resolving DNS and pinging 10.245.0.10 without issue. This pointed to a dragonfly-specific problem. Since kube-dns is backed by CoreDNS, I checked its status:

kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE      NOMINATED NODE   READINESS GATES
coredns-5d4b7f8c6f-xyz12   1/1     Running   0          12h   10.244.2.20   phoenix   <none>           <none>

CoreDNS was happily running on phoenix with pod IP 10.244.2.20. The kube-dns service looked solid too:

kubectl get svc -n kube-system kube-dns -o wide
NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE    SELECTOR
kube-dns   ClusterIP   10.245.0.10   <none>        53/UDP,53/TCP,9153/TCP   180d   k8s-app=kube-dns

So why couldn’t dragonfly reach 10.245.0.10 or, presumably, 10.244.2.20?

Digging In: Networking Nightmares

Since dragonfly is a WSL2 node, I suspected its virtualized networking might be the culprit. WSL2 uses a NAT-based network stack, and with no iptables support, k3s relies on IPVS or userspace mode for kube-proxy. My first thought was that kube-proxy on dragonfly wasn’t mapping 10.245.0.10 to 10.244.2.20, but I needed to confirm pod-to-pod connectivity first.
I tried pinging the CoreDNS pod directly:

root@dragonfly:~# ping 10.244.2.20
PING 10.244.2.20 (10.244.2.20) 56(84) bytes of data.
From 10.244.7.0 icmp_seq=1 Destination Host Unreachable
From 10.244.7.0 icmp_seq=2 Destination Host Unreachable
^C

The error came from 10.244.7.0, the flannel.1 interface, suggesting Flannel’s VXLAN wasn’t routing traffic to phoenix’s pod CIDR (10.244.2.0/24). But dragonfly could ping phoenix’s node IP:

root@dragonfly:~# ping 10.100.1.50
PING 10.100.1.50 (10.100.1.50) 56(84) bytes of data.
64 bytes from 10.100.1.50: icmp_seq=1 ttl=63 time=1.45 ms
64 bytes from 10.100.1.50: icmp_seq=2 ttl=63 time=0.921 ms
^C

Node-to-node communication was fine, so the issue was specific to Flannel’s pod networking.
Flannel’s Routing Riddle
Flannel manages pod-to-pod communication via VXLAN, tunneling traffic between nodes. Each node gets a pod subnet (e.g., 10.244.7.0/24 for dragonfly, 10.244.2.0/24 for phoenix). I checked dragonfly’s Flannel config:

root@dragonfly:~# cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.7.1/24
FLANNEL_MTU=1350
FLANNEL_IPMASQ=true

This looked correct, matching the cluster’s pod CIDR and dragonfly’s subnet. Next, I examined the routing table:

root@dragonfly:~# ip route | grep 10.244.2
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink

Wait—what? The route for phoenix’s pod CIDR (10.244.2.0/24) pointed to 10.244.2.0, the network address, not phoenix’s node IP (10.100.1.50). This explained the "Destination Host Unreachable" error: traffic was being sent to an invalid gateway. A proper route would look like:

10.244.2.0/24 via 10.100.1.50 dev flannel.1 onlink

The service CIDR route seemed fine, though:

root@dragonfly:~# ip route | grep 10.245
10.245.0.0/16 dev flannel.1 scope link

This suggested 10.245.0.10 traffic hit flannel.1, but since kube-proxy mapped it to 10.244.2.20, the bad pod CIDR route blocked it.

Peering into Flannel’s Soul

To understand why Flannel set an invalid route, I checked its interface:

root@dragonfly:~# ip -d link show flannel.1
12: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1350 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/ether 7a:b2:19:cd:45:ef brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 
    vxlan id 1 local 192.168.45.101 dev eth0 srcport 0 0 dstport 8472 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 62780 gso_max_segs 65535

The local 192.168.45.101 matched dragonfly’s WSL2 IP, and dstport 8472 was standard for Flannel’s VXLAN. But there were no remote IPs listed (e.g., 10.100.1.50 for phoenix), hinting that the VXLAN tunnel wasn’t fully established. This could mean Flannel wasn’t syncing properly with the k3s API, which stores subnet mappings.

WSL2’s Shadowy Influence

WSL2’s NAT-based networking is notorious for complicating Kubernetes setups. With no iptables, k3s uses IPVS or userspace mode for kube-proxy, and the virtual network (192.168.45.101) can interfere with VXLAN. I wondered if Windows Firewall was blocking UDP port 8472 (Flannel’s VXLAN port) or if NAT was mangling tunnel traffic. The successful ping 10.100.1.50 suggested the firewall wasn’t entirely locked down, but VXLAN might still be restricted.
Another suspect was MTU. Flannel set flannel.1 to 1350, but eth0 might be higher (e.g., 1500), potentially causing packet drops. Then there was WSL2’s kernel, which might lack full IPVS support, though the pod-to-pod failure pointed more to Flannel than kube-proxy.

Where I’m At

So far, I’ve confirmed that dragonfly’s issue stems from a bad Flannel route for phoenix’s pod CIDR, likely caused by a miscommunication in Flannel’s subnet syncing. WSL2’s NAT or firewall might be preventing the VXLAN tunnel from forming properly, leaving dragonfly isolated from 10.244.2.20 and, by extension, 10.245.0.10. The cluster’s other nodes are unaffected, happily resolving DNS, which makes dragonfly the odd one out.
I’m still unraveling this knot—torn between digging deeper into Flannel’s logs, tweaking WSL2’s network stack, or even rethinking how dragonfly integrates with the cluster. If you’ve hit similar issues with WSL2 and Kubernetes, I’d love to hear your war stories in the comments. For now, the quest continues, and I’ll keep you posted on what finally cracks this DNS dilemma.