Solving One of the Most Frustrating NFS Issues in Longhorn


One of the must painful confusions of Longhorn is communication with NFS server.
Longhorn needs NFS server to able creating volumes with RWX (ReadWriteMany) access mode. Since this problem is rare and Devops operators who currently work with Longhorn may be faced with this problem, we’re straightly going to discuss about the root of this issue.
Issue Description:
NFS server itself has much problems with some versions of Linux kernels. In this link of official Longhorn docs you can find which kernel versions are not suggested to use longhorn on them. But apart of confusing with kernels, NFS may have some issues those cause disconnecting Longhorn from NFS server. And since most of this kind of problems are so difficult to solve the Devops operator may be confused about finding the root of problem.
Recently I needed to replace the entire of my Kubernetes cluster so I should took two types of backups:
1. Velero full cluster backup to store resources of cluster and,
2. Longhorn volumes backup to keep the persistent data safe.
When I restored the Longhorn volumes in the new cluster and Velero resources next, I realized there is a problem with volumes those access mode is rwx: The pods cant read or write on their volumes and a simple “ls” command would cause the pod hang out and freeze.
I checked the entire Longhorn system but I couldn’t find any problem with Longhorn share manager pods or any other related resources, and everything was completely OK. All the share manager pods were running and connected to NFS server (based on their logs).
After lots of searching and checking the pods I realized that not all pods have this problem. About 3 pods from 58 total pods are OK and can access to their rwx volumes; so the issue was not pervasive.
If I hadn’t notice that, I might thought that the problem is with kernel and downgrade my cluster kernel to earlier version.
Solution:
So i guessed there is kind of limitations with NFS servers. On the other hand the output of dmesg | grep nfs
showed that some IP’s tried connect to the NFS server but timeout error occurred:
[334579.017116] nfs: server 10.233.14.62 not responding, timed out
[334579.528673] nfs: server 10.233.58.28 not responding, timed out
[334579.528825] nfs: server 10.233.48.185 not responding, timed out
[334579.528894] nfs: server 10.233.22.75 not responding, timed out
[334583.112813] nfs: server 10.233.46.135 not responding, timed out
...
All systemd services showed that NFS server is correctly run. After lots of research and surfing in Longhorn related documentations finally I noticed that NFS server has a max threads count limitation that causes this problem and prevents of connecting other pods to the NFS server.
The default value of max threads is 8. So I immediately increased that to 128 by doing such this:
$ sudo systemctl edit nfs-server
# Find this line and add 128 to end of the line.
...
[Service]
ExecStart=/usr/sbin/rpc.nfsd 128
...
# Save and exit.
$ sudo systemctl daemon-reexec
$ sudo systemctl restart nfs-server
By applying above changes, your NFS server max threads limitation will increase to 128 temporary and will be reverted after system reboot.
To increase the max threads count permanently, do this:
$ echo 128 > /proc/fs/nfsd/threads
At the end, restart all problematic pods and you can see that issue’s gone.
Subscribe to my newsletter
Read articles from Mostafa Motahari directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mostafa Motahari
Mostafa Motahari
Back-End developer and Python-Django lover.