FS#71775 - NFS regression experienced with 5.13.x kernels (server side)

Attached to Project: Arch Linux
Opened by Mike Javorski (javmorin) - Sunday, 08 August 2021, 22:59 GMT
Last edited by Jan Alexander Steffens (heftig) - Monday, 20 September 2021, 17:58 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Jan Alexander Steffens (heftig)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 3
Private No

Details

Description:

I have been experiencing NFS file access hangs with multiple release
versions of the 5.13.x linux kernel. In each case, all file transfers
freeze for 5-10 seconds and then resume. This seems worse when reading
through many files sequentially (jumping between and seeking within video files often provokes it.

My server:
- Archlinux w/ an arch kernel package
- filesystems exported with "rw,sync,no_subtree_check,insecure" options

Client:
- Archlinux w/ latest provided "arch" kernel (5.13.9-arch1-1 at writing)
- nfs mounted via /net autofs with "soft,nodev,nosuid" options
(ver=4.2 is indicated in mount)

I have tried the 5.13.x kernel several times since the first stable
release (most recently with 5.13.9-arch1-1), all with similar results.
Each time, I am forced to downgrade the linux package to a 5.12.x
kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
and stabilize performance. No other changes are made between tests. I
have confirmed the freezing behavior using both ext4 and btrfs
filesystems exported from this server.

At this point I would appreciate some guidance in what to provide in
order to diagnose and resolve this issue. I don't have a lot of kernel
debugging experience, so instruction would be helpful.


Additional info:
* linux 5.13.x-arch vs 5.12.15-arch1-1

This task depends upon

Closed by  Jan Alexander Steffens (heftig)
Monday, 20 September 2021, 17:58 GMT
Reason for closing:  Fixed
Additional comments about closing:  linux 5.14.6.arch1-1
Comment by loqs (loqs) - Sunday, 08 August 2021, 23:59 GMT
Have you tried 5.14-rc5? You could also consider bisecting between 5.12 and 5.13 to find the commit introducing the issue.
Try reproducing using 5.13.9 without without the three commits Arch added as requested by upstream [1]?

[1] https://lore.kernel.org/linux-nfs/CAOv1SKCmdtchm5Z2NU80o49tkrHpAkPFaHKj4-vLDN5bZNCz-Q%40mail.gmail.com/
Comment by Mike Javorski (javmorin) - Monday, 09 August 2021, 00:35 GMT
@loqs I have not tried 5.14-rc5 yet (it's compiling now). I have checked the delta between 5.13.9 mainline and the arch version, and there are no fs/rpc related deltas (as I mentioned in that linux-nfs list message). So there should be no impact there. I am going to try to find a way to reliably reproduce the issue if I can and get a task or log capture
Comment by loqs (loqs) - Monday, 09 August 2021, 03:18 GMT
https://drive.google.com/file/d/1VxZWPk1FqpTHbaNeYvfsMRjTnwNXdWWk/view?usp=sharing linux-loqs-5.12.r3616.gb5b3097d9cbb-1-x86_64.pkg.tar.zst commit before nfsd pull for 5.13
https://drive.google.com/file/d/19hRre_IeAHomEdZnoOQAySbTLOZIhY2z/view?usp=sharing linux-loqs-5.12rc4.r70.gb73ac6808b0f-1-x86_64.pkg.tar.zst last commit of nfsd pull for 5.13
If the first kernel is good and the second is bad that would narrow it down to 70 commits. kernels are linux mainline unpatched, -loqs appended so you can install alongside the linux kernel package.
Comment by Mike Javorski (javmorin) - Monday, 09 August 2021, 17:13 GMT
@loqs I did try those two kernels last night, but I was unable to trigger the behavior and ran out of time. I really need to find a way to reliably trigger the behavior, but I will try again when I am able this week. It may not be for a couple of days due to current workload.

Thank you for your help.
Comment by Mike Javorski (javmorin) - Tuesday, 10 August 2021, 16:30 GMT
I was able to recreate the freezing with the 5.14_rc4 kernel. I am going to try again with the ones @loqs provided and see if I can get either of them to trigger as well. It seems I need to leave the system up for a little while before the freezes happen (5.14 was fine when tested last night, but issues this morning)
Comment by Mike Javorski (javmorin) - Sunday, 15 August 2021, 01:30 GMT
@loqs: I was not able to recreate the freezing with the two kernels you provided. From reading the kernel commit logs, there was some fixes to NFS that were added late in the 5.13 release process due to other regressions identified. It's possible this issue was introduced at that time, or later in the release process. I went back to testing with the latest 5.13 (5.13.10-arch1) and was able to reproduce the issue. I provided a tcpdump capture on the linux-nfs list where Neil Brown (neilb at suse.de) offered me some suggested diagnostic commands to try and capture the issue. This is in the thread linked above.

Here is a link to that cap file if you are interested: https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing

I am hoping that Neil may come back with some insights as well.
Comment by taz (taz) - Friday, 27 August 2021, 13:19 GMT
as a user having the same issue (I believe), it's the server mode in the 5.13 kernel that is broken. If degrading the server to a 5.12 kernel but keeping the client with a 5.13 kernel, the issue is also gone.
I've also noticed that it takes a while before the issue starts to be noticeable, mainly several GB of data transferred, and the more data are transferred, the more frequent/slow the issue becomes. Remounting on the client, or restarting nfsd on the server does not help IIRC.

HTH
Comment by Mike Javorski (javmorin) - Friday, 27 August 2021, 17:11 GMT
I have been working on testing some patches suggested by devs on the linux-nfs mailing list. Hope to have results later in the day, and will share if something comes of them.
Comment by Mike Javorski (javmorin) - Friday, 27 August 2021, 22:08 GMT
This patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73367f05b25dbd064061aee780638564d15b01d1
from20210825193314.354079-1-trond.myklebust@hammerspace.com/"> https://lore.kernel.org/linux-nfs/20210825193314.354079-1-trond.myklebust@hammerspace.com/ seems to have resolved the freezing issue for me.

I have asked if there is any possibility of it being back-ported to linux-stable/5.13, but I don't know if that's a possibility, or the timing of same. Maybe the archlinux devs can do that for the arch kernel in the meantime?

It's already merged into 5.14 (likely to be final this weekend) so that should resolve the trouble too when the archlinux kernel is updated.
Comment by Mike Javorski (javmorin) - Friday, 27 August 2021, 22:13 GMT
FYI: If anyone wants to try a kernel with this patch in the meantime, here is a link to download the kernel I built which is just the archlinux kernel package + the patch: https://drive.google.com/file/d/1Hgonw9R8eTdH2oAJKYafcYMMiVDu7_lc/view?usp=sharing

I will leave it up until this patch (or a similar solution) lands in the main linux package.
Comment by Mike Javorski (javmorin) - Saturday, 28 August 2021, 03:30 GMT
All: Another patch appears to be needed to get NFS function back to 5.12 levels. This is in addition to the patch mentioned above.

This further patch can be found here:162915504980.9892.4132343755469951234@noble.neil.brown.name/T/#md4e6e4300ed2a36260eca0d8befb7744732df3fe"> https://lore.kernel.org/linux-nfs/162915504980.9892.4132343755469951234@noble.neil.brown.name/T/#md4e6e4300ed2a36260eca0d8befb7744732df3fe

If anyone should want to test it, but not want to deal with the recompile time, here is yet another kernel package I have built which includes both of these fixes: https://drive.google.com/file/d/19R7oECtlCLixGqMM_99kYtNGp-M7veUY/view?usp=sharing

I don't believe this second patch has been pushed for inclusion upstream yet, so it will likely miss the initial 5.14 release if that happens this weekend.
Comment by Mike Javorski (javmorin) - Friday, 10 September 2021, 19:53 GMT
For those still waiting on the fixes, the changes were merged into the 5.15 branch, and are likely going to get pulled in for 5.14.3 (they missed the 5.14.2 cutoff).

I will update once it's actually merged/released on stable (unless the Arch devs decide to backport it before then).
Comment by Mike Javorski (javmorin) - Friday, 17 September 2021, 01:10 GMT
@heftig can you please add https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.15-rc1&id=e38b3f20059426a0adbde014ff71071739ab5226 to the 5.14 arch kernel? It's in the 5.15-rc, but it has missed the last 3 stable releases. It's the second part of the NFS fixes this task covers (the first patch made it in to 5.14)




Comment by Mike Javorski (javmorin) - Monday, 20 September 2021, 15:59 GMT
All, the 5.14.6.arch1-1 kernel includes the final patch (thanks @heftig) and things look good on my test system.

I will continue monitoring upstream to make sure it lands in 5.14 proper, but at this point Arch users should be all set.

Loading...