FS#69740 - [linux-lts] nouveau_gem_object_close hanging on 5.10 lts

Attached to Project: Arch Linux
Opened by Richard PALO (risto3) - Tuesday, 23 February 2021, 08:28 GMT
Last edited by Andreas Radke (AndyRTR) - Wednesday, 01 September 2021, 09:13 GMT
Task Type Bug Report
Category Upstream Bugs
Status Closed
Assigned To Andreas Radke (AndyRTR)
Architecture x86_64
Severity High
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description:

Since upgrading from 5.4 lts to 5.10 lts, I'm experiencing frequent serious stability problems.
After the screen blocks up I have no recourse but to reboot.

Finally I booted up an old laptop to ssh into the machine and
I notice the following kernel log:
févr. 22 18:45:33 sarchx64 kernel: general protection fault, probably for non-canonical address 0xeb6f95a0d500b36f: 0000 [#1] SMP NOPTI
févr. 22 18:45:33 sarchx64 kernel: CPU: 9 PID: 6240 Comm: Xorg Tainted: P OE 5.10.17-1-lts #1
févr. 22 18:45:33 sarchx64 kernel: Hardware name: Supermicro H8SGL/H8SGL, BIOS 3.5b 03/18/2016
févr. 22 18:45:33 sarchx64 kernel: RIP: 0010:kmem_cache_alloc_trace+0xdb/0x270
févr. 22 18:45:33 sarchx64 kernel: Code: 05 e2 bd 76 6c 49 8b 00 49 83 78 10 00 48 89 04 24 0f 84 57 01 00 00 48 85 c0 0f 84 4e 01 00 00 8b 4d 28 48 8b 7d 00 48 01 c1 <48> 8b 19 48 89 ce 48 33 9d b8 00 00 00 48 0f ce 48 31 f3 40 f6 c7
févr. 22 18:45:33 sarchx64 kernel: RSP: 0018:ffffb31e22d23cd0 EFLAGS: 00010286
févr. 22 18:45:33 sarchx64 kernel: RAX: eb6f95a0d500b33f RBX: 0000000000000cc0 RCX: eb6f95a0d500b36f
févr. 22 18:45:33 sarchx64 kernel: RDX: 000000000003a2c5 RSI: 0000000000000cc0 RDI: 00000000000300c0
févr. 22 18:45:33 sarchx64 kernel: RBP: ffff9ede40043a00 R08: ffff9ee55fcf00c0 R09: 0000000000000000
févr. 22 18:45:33 sarchx64 kernel: R10: ffff9ee1d49c84b8 R11: ffffb31e22d23d40 R12: 0000000000000cc0
févr. 22 18:45:33 sarchx64 kernel: R13: 0000000000000048 R14: ffffffffc147a629 R15: 0000000000000000
févr. 22 18:45:33 sarchx64 kernel: FS: 00007efc19f45940(0000) GS:ffff9ee55fcc0000(0000) knlGS:0000000000000000
févr. 22 18:45:33 sarchx64 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
févr. 22 18:45:33 sarchx64 kernel: CR2: 00007fd79b9fa000 CR3: 00000001448b4000 CR4: 00000000000406e0
févr. 22 18:45:33 sarchx64 kernel: Call Trace:
févr. 22 18:45:33 sarchx64 kernel: nouveau_gem_object_close+0x119/0x1f0 [nouveau]
févr. 22 18:45:33 sarchx64 kernel: drm_gem_object_release_handle+0x30/0x90 [drm]
févr. 22 18:45:33 sarchx64 kernel: drm_gem_handle_delete+0x58/0x90 [drm]
févr. 22 18:45:33 sarchx64 kernel: ? drm_gem_handle_create+0x40/0x40 [drm]
févr. 22 18:45:33 sarchx64 kernel: drm_ioctl_kernel+0xb2/0x100 [drm]
févr. 22 18:45:33 sarchx64 kernel: drm_ioctl+0x215/0x390 [drm]
févr. 22 18:45:33 sarchx64 kernel: ? drm_gem_handle_create+0x40/0x40 [drm]
févr. 22 18:45:33 sarchx64 kernel: nouveau_drm_ioctl+0x55/0xa0 [nouveau]
févr. 22 18:45:33 sarchx64 kernel: __x64_sys_ioctl+0x83/0xb0
févr. 22 18:45:33 sarchx64 kernel: do_syscall_64+0x33/0x40
févr. 22 18:45:33 sarchx64 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9

The system is not reliable enough now to stay in production!

Was working great under 5.4.

Additional info:
* package version(s)
* config and/or log files etc.
* link to upstream bug report, if any

Mainboard is a Supermicro H8SGL/H8SGL, BIOS 3.5b 03/18/2016
with an AMD Opteron(tm) Processor 6338P
and a NVIDIA Corporation GK208B [GeForce GT 730] (rev a1)

Linux sarchx64 5.10.17-1-lts #1 SMP Wed, 17 Feb 2021 11:11:31 +0000 x86_64 GNU/Linux

nouveau-fw 340.32-1
xf86-video-nouveau 1.0.17-1
xorg-server 1.20.10-3


Steps to reproduce:
upgrade from 5.4 lts to latest 5.10

I seem to always get hit when using firefox after about a 15-30 minutes.
It happens frequently but I can't say I can reproduce after a serious a keystrokes.

This task depends upon

Closed by  Andreas Radke (AndyRTR)
Wednesday, 01 September 2021, 09:13 GMT
Reason for closing:  Upstream
Comment by Richard PALO (risto3) - Tuesday, 23 February 2021, 09:01 GMT
Looking further back, I notice that it is not only nouveau, but also the builtin mga controller (not used)
févr. 20 08:58:57 sarchx64 kernel: WARNING: CPU: 8 PID: 8166 at drivers/gpu/drm/drm_gem_shmem_helper.c:197 drm_gem_shmem_put_pages_locked+0x4a/0x50 [drm]
févr. 20 08:58:57 sarchx64 kernel: Modules linked in: ip_set bonding nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_limit nft_counter nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink_log nfne>
févr. 20 08:58:57 sarchx64 kernel: vboxdrv(OE) usbip_host usbip_core ipmi_si nfsd ipmi_devintf ipmi_msghandler auth_rpcgss sg nfs_acl lockd drm crypto_user grace acpi_call(OE) sunrpc nfs_ssc agpgart fuse bpf_preload ip_tables x_tables >
févr. 20 08:58:57 sarchx64 kernel: CPU: 8 PID: 8166 Comm: Xorg Tainted: P OE 5.10.17-1-lts #1
févr. 20 08:58:57 sarchx64 kernel: Hardware name: Supermicro H8SGL/H8SGL, BIOS 3.5b 03/18/2016
févr. 20 08:58:57 sarchx64 kernel: RIP: 0010:drm_gem_shmem_put_pages_locked+0x4a/0x50 [drm]
févr. 20 08:58:57 sarchx64 kernel: Code: 0f b6 97 88 01 00 00 48 8b b7 68 01 00 00 89 d1 83 e2 01 d0 e9 83 e1 01 e8 93 2f fd ff 48 c7 83 68 01 00 00 00 00 00 00 5b c3 <0f> 0b c3 0f 1f 00 0f 1f 44 00 00 41 54 4c 8d a7 48 01 00 00 55 48
févr. 20 08:58:57 sarchx64 kernel: RSP: 0018:ffff98fac3b73ba0 EFLAGS: 00010246
févr. 20 08:58:57 sarchx64 kernel: RAX: 0000000000000000 RBX: ffff8b22d0396000 RCX: 0000000000000000
févr. 20 08:58:57 sarchx64 kernel: RDX: ffff8b2301a7db80 RSI: d923af1586e9727c RDI: ffff8b25fa201200
févr. 20 08:58:57 sarchx64 kernel: RBP: ffff8b25fa201200 R08: ffffffff98a2e490 R09: ffff98faedee8000
févr. 20 08:58:57 sarchx64 kernel: R10: ffff98faedbe7000 R11: ffff8b2651f39940 R12: ffff8b25fa201398
févr. 20 08:58:57 sarchx64 kernel: R13: ffff8b25fa201348 R14: ffff8b262ec98800 R15: ffff8b25f2040108
févr. 20 08:58:57 sarchx64 kernel: FS: 00007f8f4b795940(0000) GS:ffff8b29dfc80000(0000) knlGS:0000000000000000
févr. 20 08:58:57 sarchx64 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
févr. 20 08:58:57 sarchx64 kernel: CR2: 000055d3101769a0 CR3: 00000001326f8000 CR4: 00000000000406e0
févr. 20 08:58:57 sarchx64 kernel: Call Trace:
févr. 20 08:58:57 sarchx64 kernel: drm_gem_shmem_vunmap+0x6e/0xa0 [drm]
févr. 20 08:58:57 sarchx64 kernel: mgag200_handle_damage+0x50/0x220 [mgag200]
févr. 20 08:58:57 sarchx64 kernel: mgag200_simple_display_pipe_update+0x7c/0x90 [mgag200]
févr. 20 08:58:57 sarchx64 kernel: drm_atomic_helper_commit_planes+0xb8/0x220 [drm_kms_helper]
févr. 20 08:58:57 sarchx64 kernel: drm_atomic_helper_commit_tail+0x42/0x80 [drm_kms_helper]
févr. 20 08:58:57 sarchx64 kernel: commit_tail+0xce/0x130 [drm_kms_helper]
févr. 20 08:58:57 sarchx64 kernel: drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
févr. 20 08:58:57 sarchx64 kernel: drm_atomic_helper_set_config+0x70/0xb0 [drm_kms_helper]
févr. 20 08:58:57 sarchx64 kernel: drm_mode_setcrtc+0x233/0x770 [drm]
févr. 20 08:58:57 sarchx64 kernel: ? drm_mode_getcrtc+0x180/0x180 [drm]
févr. 20 08:58:57 sarchx64 kernel: drm_ioctl_kernel+0xb2/0x100 [drm]
févr. 20 08:58:57 sarchx64 kernel: drm_ioctl+0x215/0x390 [drm]
févr. 20 08:58:57 sarchx64 kernel: ? drm_mode_getcrtc+0x180/0x180 [drm]
févr. 20 08:58:57 sarchx64 kernel: __x64_sys_ioctl+0x83/0xb0
févr. 20 08:58:57 sarchx64 kernel: do_syscall_64+0x33/0x40
févr. 20 08:58:57 sarchx64 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
févr. 20 08:58:57 sarchx64 kernel: RIP: 0033:0x7f8f4b5ade6b
févr. 20 08:58:57 sarchx64 kernel: Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 af 0c 00 f7 d8 64 89 01 48
févr. 20 08:58:57 sarchx64 kernel: RSP: 002b:00007fff9004aba8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
févr. 20 08:58:57 sarchx64 kernel: RAX: ffffffffffffffda RBX: 00007fff9004abe0 RCX: 00007f8f4b5ade6b
févr. 20 08:58:57 sarchx64 kernel: RDX: 00007fff9004abe0 RSI: 00000000c06864a2 RDI: 0000000000000011
févr. 20 08:58:57 sarchx64 kernel: RBP: 00000000c06864a2 R08: 0000000000000000 R09: 000055d310987990
févr. 20 08:58:57 sarchx64 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055d3101761e0
févr. 20 08:58:57 sarchx64 kernel: R13: 0000000000000011 R14: 000055d3100d4218 R15: 0000000000000001
févr. 20 08:58:57 sarchx64 kernel: ---[ end trace de0a0379cbc23111 ]---
Comment by Andreas Radke (AndyRTR) - Tuesday, 23 February 2021, 09:08 GMT
There's probably not much we can do at distro level. Please get in touch with upstream devs.
Comment by Andreas Radke (AndyRTR) - Tuesday, 23 February 2021, 09:11 GMT
Not sure if your card actually requires the firmware. If so feel free to update the firmware package.
Comment by Richard PALO (risto3) - Tuesday, 23 February 2021, 16:11 GMT
I checked via https://github.com/envytools/firmware which seems to have a somewhat more up to date extract_firmware.py
There is no difference between the file blob I had before...
In any event, as I saw problems with the builtin mga controller on my mainboard, I'm not sure it is necessarily 'nouveau'
at fault... perhaps the problem is with the drm helpers.
It's been a long while since I've sollicited the xorg devs, where is the current appropriate upstream site to file an issue?
Comment by Andreas Radke (AndyRTR) - Tuesday, 23 February 2021, 16:33 GMT Comment by Askhat Bakarov (sirocco) - Thursday, 11 March 2021, 09:22 GMT
Similar problem after update from 5.4.98 lts to 5.10.22 lts

02:00.0 VGA compatible controller: NVIDIA Corporation G84 [GeForce 8600 GT] (rev a1)

So far I have not seen problems with linux 5.11.4


мар 11 16:04:54 cehost kernel: general protection fault, probably for non-canonical address 0xf4ec2f5f5dd68acf: 0000 [#1] SMP NOPTI
мар 11 16:04:54 cehost kernel: CPU: 1 PID: 370 Comm: Xorg Tainted: G I 5.10.22-2-lts #1
мар 11 16:04:54 cehost kernel: Hardware name: System manufacturer System Product Name/M2NPV-VM, BIOS ASUS M2NPV-VM ACPI BIOS Revision 5005 06/02/2010
мар 11 16:04:54 cehost kernel: RIP: 0010:kmem_cache_alloc_trace+0xdb/0x270
мар 11 16:04:54 cehost kernel: Code: 05 c2 a7 36 7c 49 8b 00 49 83 78 10 00 48 89 04 24 0f 84 57 01 00 00 48 85 c0 0f 84 4e 01 00 00 8b 4d 28 48 8b 7d 00 48 01 c1 <48> 8b 19 48 89 ce 48 33 9d b8 00 00 00 48 0f ce 48 31 f3 40 f6 c7
мар 11 16:04:54 cehost kernel: RSP: 0018:ffffaf96c06bfc20 EFLAGS: 00010286
мар 11 16:04:54 cehost kernel: RAX: f4ec2f5f5dd68a9f RBX: 0000000000000dc0 RCX: f4ec2f5f5dd68acf
мар 11 16:04:54 cehost kernel: RDX: 000000000001e2ea RSI: 0000000000000dc0 RDI: 00000000000300c0
мар 11 16:04:54 cehost kernel: RBP: ffff9d0f00042a00 R08: ffff9d0f2bcb00c0 R09: ffff9d0f00a000b8
мар 11 16:04:54 cehost kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000dc0
мар 11 16:04:54 cehost kernel: R13: 0000000000000060 R14: ffffffffc041bb63 R15: 0000000000000000
мар 11 16:04:54 cehost kernel: FS: 00007f40a5af9940(0000) GS:ffff9d0f2bc80000(0000) knlGS:0000000000000000
мар 11 16:04:54 cehost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
мар 11 16:04:54 cehost kernel: CR2: 00007f4098859000 CR3: 000000010d000000 CR4: 00000000000006e0
мар 11 16:04:54 cehost kernel: Call Trace:
мар 11 16:04:54 cehost kernel: nouveau_fence_new+0x33/0xb0 [nouveau]
мар 11 16:04:54 cehost kernel: nouveau_gem_ioctl_pushbuf+0xa5d/0x11b0 [nouveau]
мар 11 16:04:54 cehost kernel: ? nouveau_gem_ioctl_new+0xf0/0xf0 [nouveau]
мар 11 16:04:54 cehost kernel: drm_ioctl_kernel+0xb2/0x100 [drm]
мар 11 16:04:54 cehost kernel: drm_ioctl+0x215/0x390 [drm]
мар 11 16:04:54 cehost kernel: ? nouveau_gem_ioctl_new+0xf0/0xf0 [nouveau]
мар 11 16:04:54 cehost kernel: nouveau_drm_ioctl+0x55/0xa0 [nouveau]
мар 11 16:04:54 cehost kernel: __x64_sys_ioctl+0x83/0xb0
мар 11 16:04:54 cehost kernel: do_syscall_64+0x33/0x40
мар 11 16:04:54 cehost kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
мар 11 16:04:54 cehost kernel: RIP: 0033:0x7f40a64e1e6b
мар 11 16:04:54 cehost kernel: Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 af 0c 00 f7 d8 64 89 01 48
мар 11 16:04:54 cehost kernel: RSP: 002b:00007ffdce3a0a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
мар 11 16:04:54 cehost kernel: RAX: ffffffffffffffda RBX: 00007ffdce3a0a90 RCX: 00007f40a64e1e6b
мар 11 16:04:54 cehost kernel: RDX: 00007ffdce3a0a90 RSI: 00000000c0406481 RDI: 000000000000000e
мар 11 16:04:54 cehost kernel: RBP: 00000000c0406481 R08: 000055d7ba253880 R09: 000055d7ba26cb78
мар 11 16:04:54 cehost kernel: R10: 000055d7ba7bd510 R11: 0000000000000246 R12: 000055d7ba25bb70
мар 11 16:04:54 cehost kernel: R13: 000000000000000e R14: 000055d7ba250d00 R15: 000055d7ba253880
мар 11 16:04:54 cehost kernel: Modules linked in: vfat fat xfs libcrc32c snd_hda_codec_analog snd_hda_codec_generic ledtrig_audio cfg80211 rfkill 8021q garp mrp stp llc hwmon_vid snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence snd_hda_codec snd_hda_core snd_hwdep soundwire_bus edac_mce_amd kvm_amd snd_soc_core ccp rng_core kvm snd_compress ppdev ac97_bus snd_pcm_dmaengine snd_pcm irqbypass snd_timer pcspkr k10temp mousedev snd soundcore forcedeth nv_tco i2c_nforce2 parport_pc parport asus_atk0110 mac_hid acpi_cpufreq sr_mod cdrom usbip_host usbip_core sg crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usb_storage usbhid ata_generic pata_acpi firewire_ohci firewire_core serio_raw crc_itu_t sata_nv pata_amd xhci_pci xhci_pci_renesas nouveau video i2c_algo_bit mxm_wmi wmi ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart
мар 11 16:04:54 cehost kernel: ---[ end trace b2150b7062ac0ba1 ]---
мар 11 16:04:54 cehost kernel: RIP: 0010:kmem_cache_alloc_trace+0xdb/0x270
мар 11 16:04:54 cehost kernel: Code: 05 c2 a7 36 7c 49 8b 00 49 83 78 10 00 48 89 04 24 0f 84 57 01 00 00 48 85 c0 0f 84 4e 01 00 00 8b 4d 28 48 8b 7d 00 48 01 c1 <48> 8b 19 48 89 ce 48 33 9d b8 00 00 00 48 0f ce 48 31 f3 40 f6 c7
мар 11 16:04:54 cehost kernel: RSP: 0018:ffffaf96c06bfc20 EFLAGS: 00010286
мар 11 16:04:54 cehost kernel: RAX: f4ec2f5f5dd68a9f RBX: 0000000000000dc0 RCX: f4ec2f5f5dd68acf
мар 11 16:04:54 cehost kernel: RDX: 000000000001e2ea RSI: 0000000000000dc0 RDI: 00000000000300c0
мар 11 16:04:54 cehost kernel: RBP: ffff9d0f00042a00 R08: ffff9d0f2bcb00c0 R09: ffff9d0f00a000b8
мар 11 16:04:54 cehost kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000dc0
мар 11 16:04:54 cehost kernel: R13: 0000000000000060 R14: ffffffffc041bb63 R15: 0000000000000000
мар 11 16:04:54 cehost kernel: FS: 00007f40a5af9940(0000) GS:ffff9d0f2bc80000(0000) knlGS:0000000000000000
мар 11 16:04:54 cehost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
мар 11 16:04:54 cehost kernel: CR2: 00007f4098859000 CR3: 000000010d000000 CR4: 00000000000006e0

Loading...