FS#80323 : [mesa] radeonsi python-pytorch ROCm segfaults

FS#80323 - [mesa] radeonsi python-pytorch ROCm segfaults

Attached to Project: Arch Linux
Opened by c (grinness) - Wednesday, 22 November 2023, 09:50 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:21 GMT

Task Type	Bug Report
Category	Packages: Extra
Status	Closed
Assigned To	Jan Alexander Steffens (heftig) Laurent Carlier (lordheavy) Felix Yan (felixonmars)
Architecture	All
Severity	Low
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	0
Private	No

Details

Description:

I experience exactly the same behavior as in ~~FS#79725~~ that has been closed with the updated package for python-pytorch-rocm and updated version of ROCm and HIP stack (5.7.1)
pytorch applications segfaults on gfx 1030 (rx6800)

Furthermore runnning the below:

AMD_LOG_LEVEL=1 python
>>> import torch
>>> torch.cuda.current_device()

shows a bunch of errors regarding:

hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :517 : 0976091129 us: [pid:24304 tid:0x7f34c722e740] Devices:
:1:hip_code_object.cpp :519 : 0976091133 us: [pid:24304 tid:0x7f34c722e740] amdgcn-amd-amdhsa--gfx1030 - [Not Found]
:1:hip_code_object.cpp :524 : 0976091135 us: [pid:24304 tid:0x7f34c722e740] Bundled Code Objects:
:1:hip_code_object.cpp :540 : 0976091138 us: [pid:24304 tid:0x7f34c722e740] host-x86_64-unknown-linux-- - [Unsupported]
:1:hip_code_object.cpp :537 : 0976091141 us: [pid:24304 tid:0x7f34c722e740] hipv4-amdgcn-amd-amdhsa--gfx906 - [code object targetID is amdgcn-amd-amdhsa--gfx906]

See attachment.

Note that compiling from source and disabling magma (USE_MAGMA=OFF) solves the problem
I also attach the PKGBUILD that works for reference -- same PKGBUILD posted in ~~FS#79725~~

pytorch-rocm-hip-errors.txt (904.4 KiB)

PKGBUILD (1.4 KiB)

This task depends upon

Closed by Buggy McBugFace (bugbot)
Saturday, 25 November 2023, 20:21 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/mesa/issues/3

Comment by Toolybird (Toolybird) - Wednesday, 22 November 2023, 19:40 GMT

@grinness, you should know by now that if reporting segfault crashes, you *must* provide a backtrace. And it *must* be a be backtrace that includes debugging information via debuginfod. For example, the backtrace you posted in ~~FS#80301~~ is missing debug symbols. If you don't see source code line numbers in the trace then it's essentially useless.

Ensure gdb is installed then:

$ coredumpctl gdb (then answer y when it asks "Enable debuginfod for this session?")
(gdb) set logging enabled
(gdb) bt (or bt full)

Then post gdb.txt

More reading at [1][2]

[1] https://blogs.gnome.org/mcatanzaro/2021/09/18/creating-quality-backtraces-for-crash-reports/
[2] https://wiki.archlinux.org/title/Debugging/Getting_traces

Comment by c (grinness) - Wednesday, 22 November 2023, 20:39 GMT

@Toolybird (Toolybird)

apologies, I have run a sample python code training a neural network under gdb and found that the segmentation fault is not in pytorch-rocm, it is actually caused by a call to matplotlib (commenting the relevant code out no segmentation dump)
I attach the gdb out regardless -- the debug info seems to point to unaligned memory in radeonsi

Note that the warnings about amdgcn-amd-amdhsa--gfx1030 - [Not Found] are present running the sample code provided in my first post.

If you and the maintainer want I can close this and open a new one with the correct title.

gdb-python-minst-nn-out.txt (41.7 KiB)

Comment by Toolybird (Toolybird) - Wednesday, 22 November 2023, 21:05 GMT

> the debug info seems to point to unaligned memory in radeonsi

Ok, thanks for that. Therefore it seems like an upstream bug in mesa. You should probably report this crash upstream but I will firstly reassign this ticket to the mesa PM's for a look-see.

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Arch Linux

FS#80323 - [mesa] radeonsi python-pytorch ROCm segfaults

Details

Loading...