Please read this before reporting a bug:
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
https://wiki.archlinux.org/title/Bug_reporting_guidelines
Do NOT report bugs when a package is just outdated, or it is in the AUR. Use the 'flag out of date' link on the package page, or the Mailing List.
REPEAT: Do NOT report bugs for outdated packages!
FS#80323 - [mesa] radeonsi python-pytorch ROCm segfaults
Attached to Project:
Arch Linux
Opened by c (grinness) - Wednesday, 22 November 2023, 09:50 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:21 GMT
Opened by c (grinness) - Wednesday, 22 November 2023, 09:50 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:21 GMT
|
DetailsDescription:
I experience exactly the same behavior as in pytorch applications segfaults on gfx 1030 (rx6800) Furthermore runnning the below: AMD_LOG_LEVEL=1 python >>> import torch >>> torch.cuda.current_device() shows a bunch of errors regarding: hipErrorNoBinaryForGpu: Unable to find code object for all current devices! :1:hip_code_object.cpp :517 : 0976091129 us: [pid:24304 tid:0x7f34c722e740] Devices: :1:hip_code_object.cpp :519 : 0976091133 us: [pid:24304 tid:0x7f34c722e740] amdgcn-amd-amdhsa--gfx1030 - [Not Found] :1:hip_code_object.cpp :524 : 0976091135 us: [pid:24304 tid:0x7f34c722e740] Bundled Code Objects: :1:hip_code_object.cpp :540 : 0976091138 us: [pid:24304 tid:0x7f34c722e740] host-x86_64-unknown-linux-- - [Unsupported] :1:hip_code_object.cpp :537 : 0976091141 us: [pid:24304 tid:0x7f34c722e740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906] See attachment. Note that compiling from source and disabling magma (USE_MAGMA=OFF) solves the problem I also attach the PKGBUILD that works for reference -- same PKGBUILD posted in |
This task depends upon
Closed by Buggy McBugFace (bugbot)
Saturday, 25 November 2023, 20:21 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/mesa/issues/3
Saturday, 25 November 2023, 20:21 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/mesa/issues/3
pytorch-rocm-hip-errors.txt
FS#80301is missing debug symbols. If you don't see source code line numbers in the trace then it's essentially useless.Ensure gdb is installed then:
$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")
(gdb) set logging enabled
(gdb) bt (or bt full)
Then post gdb.txt
More reading at [1][2]
[1] https://blogs.gnome.org/mcatanzaro/2021/09/18/creating-quality-backtraces-for-crash-reports/
[2] https://wiki.archlinux.org/title/Debugging/Getting_traces
apologies, I have run a sample python code training a neural network under gdb and found that the segmentation fault is not in pytorch-rocm, it is actually caused by a call to matplotlib (commenting the relevant code out no segmentation dump)
I attach the gdb out regardless -- the debug info seems to point to unaligned memory in radeonsi
Note that the warnings about amdgcn-amd-amdhsa--gfx1030 - [Not Found] are present running the sample code provided in my first post.
If you and the maintainer want I can close this and open a new one with the correct title.
Ok, thanks for that. Therefore it seems like an upstream bug in mesa. You should probably report this crash upstream but I will firstly reassign this ticket to the mesa PM's for a look-see.