FS#59419 - [linux] general protection fault: 0000 (random kernel panic, oops; nfs, nfsd)

Attached to Project: Arch Linux
Opened by John Doe (user832) - Monday, 23 July 2018, 00:06 GMT
Last edited by freswa (frederik) - Sunday, 13 September 2020, 14:05 GMT
Task Type Bug Report
Category Kernel
Status Closed
Assigned To Tobias Powalowski (tpowa)
Jan Alexander Steffens (heftig)
Architecture x86_64
Severity Critical
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 2
Private No

Details

I had prepared to report this to https://bugzilla.kernel.org, however, reading "Distribution kernels" (https://www.kernel.org/category/releases.html),
if I understand it correctly, all distribution kernels, whether long term kernel releases or otherwise should be reported to the distribution vendor.

As I do not know how or if the Arch Linux distributed kernels differ from the kernel.org releases, I decided to first report the problem here. Please advice if I should report the problem directly to kernel.org.

Running Arch Linux distribution, I am observing kernel instability after upgrading from 4.15.9-1-ARCH to 4.17.2-1-ARCH.
Anticipating newer kernel releases I have waited and upgraded released kernels accordingly hoping the issue may resolve itself. The problem is however persistent as of 4.17.8-1-ARCH.

The symptom
In a diskless node, workstation, system configuration with 2 client computers, the server kernel panics and outputs the attached error messages.

The consequence
The client computers freeze after a short while; around a minute. The server however seems to be unaffected, that is, it does not crash, freeze, or is otherwise unresponsive.
Requesting `systemctl status nfs-server` reports that the nfs-server is running.
Rebooting the client, without rebooting the server, results in the client not being able to mount NFS4 again at boot. All computers must be shutdown, and then rebooted.

The cause
As far as I can tell, it is seemingly random. I cannot recreate it. The error could occur, for example,
* when a client computer is booting up,
* when a client computer is booted and running, and the other client computer is booting up.
* when using a client computer, such as, opening a web page.

The configuration
The server is more or less barebone; minimal installation, no GUI, and performs no other major task.
On the clients, the main programs being used during a crash is Firefox, terminal (KDE), text editors (KDE).
I can provide more information about the server computer's or client computers' configuration if necessary.

The version of NFS running is 4.2.

Client computers
* 4.17.4-1-ARCH
* nfs-utils 2.3.2-2
* `cat /proc/version` -> Linux version 4.17.4-1-ARCH (builduser@heftig-469) (gcc version 8.1.1 20180531 (GCC)) #1 SMP PREEMPT Tue Jul 3 15:45:09 UTC 2018

Server computer
* 4.17.8-1-ARCH
* nfs-utils 2.3.2-2
* `cat /proc/version` -> Linux version 4.17.8-1-ARCH (builduser@heftig-21239) (gcc version 8.1.1 20180531 (GCC)) #1 SMP PREEMPT Wed Jul 18 09:56:24 UTC 2018

Other bug reports on bugzilla.kernel.org that may be related.
* Bug 200379 - kernel panic in NFSv4 server on high load (1000+/sec accesses from 3 clients)
https://bugzilla.kernel.org/show_bug.cgi?id=200379
* Bug 199457 - [NFS] general protection fault: 0000 [#1] SMP PTI
https://bugzilla.kernel.org/show_bug.cgi?id=199457

The only bug report on bugs.archlinux.org I found that may be related.
*  FS#57474  - [linux] Kernel general protection fault when trying to use CIFS with 4.15.2-2-ARCH
https://bugs.archlinux.org/task/57474?string=&project=1&search_name=&type%5B0%5D=1&sev%5B0%5D=&pri%5B0%5D=&due%5B0%5D=&reported%5B0%5D=&cat%5B0%5D=12&status%5B0%5D=open&percent%5B0%5D=&opened=&dev=&closed=&duedatefrom=&duedateto=&changedfrom=&changedto=&openedfrom=&openedto=&closedfrom=&closedto=




   log.txt (37.1 KiB)
This task depends upon

Closed by  freswa (frederik)
Sunday, 13 September 2020, 14:05 GMT
Reason for closing:  Fixed
Comment by loqs (loqs) - Monday, 23 July 2018, 09:53 GMT
You could rebuild the kernel without the four patches from https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux (three of which are due for future kernel inclusion)
Then you can report the issue upstream to the linux-nfs mailing list http://vger.kernel.org/vger-lists.html#linux-nfs please note not the bug tracker as not all kernel subsystems use the bug tracker.
Comment by John Doe (user832) - Wednesday, 25 July 2018, 19:02 GMT
> You could rebuild the kernel without the four patches from https://git.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux (three of which are due for future kernel inclusion)

Thank you for your assistance.

Unfortunately I have never come around to compile the Linux kernel, although it is on my "to learn" list.

* https://wiki.archlinux.org/index.php/Kernels/Arch_Build_System
* https://wiki.archlinux.org/index.php/Kernels/Traditional_compilation

If I attempt to do that, as it will take some time, I will then report the results upstream.

> Then you can report the issue upstream to the linux-nfs mailing list http://vger.kernel.org/vger-lists.html#linux-nfs please note not the bug tracker as not all kernel subsystems use the bug tracker.

Again, thank you.

I am unfamiliar with the kernel community, according to
* https://www.kernel.org/doc/html/v4.15/admin-guide/reporting-bugs.html#identify-who-to-notify

the alternatives are either bugzilla or the mailing list obtained from the MAINTAINERS file,

* https://github.com/torvalds/linux/blob/master/MAINTAINERS

Bugzilla does however have the category "filsystems, NFS".

I presume the vger list corresponds to the the MAINTAINERS file, no?

For future reference, how can I determine whether to use bugzilla or the maintainers file, vger list?
Comment by loqs (loqs) - Wednesday, 25 July 2018, 19:42 GMT
I would suggest you do a bisection before contacting upstream (consider build 4.18-rc6 first to see if the issue is already fixed)
I do not know any authoritative source for which subsystems prefer bug reports by mailing list rather than bugzilla.
Comment by John Doe (user832) - Tuesday, 21 August 2018, 22:50 GMT
I have updated the server kernel twice since my last post. The client computers are however still running 4.17.4-1-ARCH, and both the server and the client computers' nfs-utils are v2.3.2-2.

As I cannot recreate the problem all I could do was to use the computer as usual and see if the error would reappear.

4.17.11-arch1 (updated 180803)
The server worked for ~6 days before the error occurred again.

4.17.14-arch1-1-ARCH (updated 180811)
The server has worked since without a problem.

Not sure who to thank, or whether it may be too early to celebrate. I will keep updating the kernel regularly, but I think this bug report can be closed as I cannot recreate the problem, and the problem seems to have been resolved with the newer update.

Thank you.

Loading...