FS#39203 - [nvidia] CUDA (and OpenCL) not working with nvidia 334.21; works with 331.38

Attached to Project: Arch Linux
Opened by Ochi (ochi) - Thursday, 06 March 2014, 18:26 GMT
Last edited by Felix Yan (felixonmars) - Saturday, 29 March 2014, 02:41 GMT
Task Type Bug Report
Category Packages: Extra
Status Closed
Assigned To Ionut Biru (wonder)
Sven-Hendrik Haase (Svenstaro)
Felix Yan (felixonmars)
Architecture x86_64
Severity Medium
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 9
Private No

Details

Description:

Neither CUDA nor OpenCL applications seem to be able to run with nvidia 334.21, but they are working with 331.38 (and earlier versions). Using CUDA 5.0 or 5.5 does not seem to make a difference.

Steps to reproduce:

For testing CUDA, try running e.g. the "vectorAdd" example from the cuda package. The result using nvidia 334.21 is:

>./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code unknown error)!

For testing OpenCL, you may use this minimal program that tries to get the available platforms: http://pastebin.com/1x9gpXMf
Result with 331.38 is Count = 1, Error = 0. Result with 334.21 is Count = 0, Error = -1001 (CL_PLATFORM_NOT_FOUND_KHR).
This task depends upon

Closed by  Felix Yan (felixonmars)
Saturday, 29 March 2014, 02:41 GMT
Reason for closing:  Fixed
Additional comments about closing:  Added nvidia-modprobe to nvidia-utils as a workaround. Real fix should be on nvidia side.
Comment by Doug Newgard (Scimmia) - Thursday, 06 March 2014, 18:36 GMT
You didn't give the specific package version. If you 334.21-2, ok, but if you're on -1, this is a duplicate and has been fixed.
Comment by Ochi (ochi) - Thursday, 06 March 2014, 19:41 GMT
Hello,

the installed package versions are:

nvidia 334.21-2 <- note pkgrel 2
nvidia-utils 334.21-1
nvidia-libgl 334.21-1
opencl-nvidia 334.21-1
opencl-headers 2:1.1.20110526-1
libcl 1.1-3
cuda 5.5.22-1
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 09 March 2014, 01:02 GMT
Please test cuda from [testing].
Comment by Ochi (ochi) - Sunday, 09 March 2014, 10:35 GMT
Hm, cuda 6.0.26_rc-1 doesn't seem to make any difference for me. Were you able to repro the bug and did cuda 6 help to resolve the issue for you?
Comment by Carlos Silva (r3pek) - Sunday, 09 March 2014, 14:15 GMT
I have the same problem with cudaminer. It doesn't detect any cuda devices.
Comment by arch user (archuser474747) - Monday, 10 March 2014, 04:15 GMT
On the OpenCL front my gpu isn't being detected either. I use it for BOINC, and don't have cuda installed.

nvidia 334.21-2
opencl-nvidia 334.21-3
boinc 7.2.42-1

Was also working fine before 334.21 as per the initial report, and my gpu (gtx570) is picked up with the *-304xx-* packages.

-edit-
also notice that libcl has been out of date for a few months. Does this have anything to do with it?
https://www.archlinux.org/packages/extra/x86_64/libcl/
Comment by Wellington Melo (wwmm) - Tuesday, 11 March 2014, 11:51 GMT
No it doesn't. The reason why libcl is out of date is the lack of NVIDIA support to opencl 1.2
Comment by Bharath Ghanta (bharath1097) - Thursday, 13 March 2014, 17:46 GMT
It happens to me too with OpenCL.
nvidia 334.21-2
opencl-nvidia 334.21-3
Comment by Doug Newgard (Scimmia) - Saturday, 15 March 2014, 02:22 GMT Comment by Felix Yan (felixonmars) - Saturday, 15 March 2014, 05:54 GMT
There's also a related post on upstream forum:

https://devtalk.nvidia.com/default/topic/699610/linux/334-21-driver-returns-999-on-cuinit-cuda-/post/4148890/#4148890

I'm not sure if we should do setsid though...
Comment by Felix Yan (felixonmars) - Wednesday, 26 March 2014, 14:18 GMT
Please test with testing/nvidia 334.21-3 or nvidia-lts 334.21-4, thanks!
Comment by Dave Reisner (falconindy) - Wednesday, 26 March 2014, 18:22 GMT
We really shouldn't be distributing nvidia-modprobe. If you need the module explicitly loaded, there's already /etc/modules-load.d for doing so.

Moreover, upstream suggests distro-specific module loading methods over this toy:

https://github.com/NVIDIA/nvidia-modprobe/blob/master/nvidia-modprobe.c#L18
Comment by Ochi (ochi) - Wednesday, 26 March 2014, 21:56 GMT
I'll leave it up to you to decide how to fix the problem, but I can say that testing/nvidia 334.21-3 works for me.
Comment by Dave Reisner (falconindy) - Wednesday, 26 March 2014, 22:03 GMT
Does downgrading to a "non-working" package and forcibly loading the module work? There might be need for a tmpfiles fragment to create the character device...
Comment by Ochi (ochi) - Wednesday, 26 March 2014, 22:36 GMT
If you mean the "nvidia_uvm" module: No, it does not seem to suffice to modprobe that with the 334.21-2 version. Guess the creation of the necessary device node (/dev/nvidia-uvm) is not done by loading that module (which is magically created though as soon as an application using CUDA is run as root).
Comment by Dave Reisner (falconindy) - Wednesday, 26 March 2014, 22:44 GMT
So then we can distribute a tmpfiles.d fragment which contains something like:

c /dev/nvidia-uvm 0644 - - - M:m

where M and m are the major and minor, respectively, of /dev/nvidia-uvm.
Comment by Ochi (ochi) - Wednesday, 26 March 2014, 23:07 GMT
I thought so, too. But I first tried

c /dev/nvidia-uvm 0666 root root - 246:0 (which as far as I can see matches the dev node automatically generated by nvidia)

and then your suggestion

c /dev/nvidia-uvm 0644 - - - 246:0

but it still does not work for me (e.g. the vectorAdd example from the CUDA package) even though the /dev/nvidia-uvm device nodes are generated already at boot. Starting that example application once as root makes it work... for whatever reason. Am I missing something? By the way, the 246:0 are the major/minor numbers that my device nodes generated by nvidia seem to have.
Comment by Dave Reisner (falconindy) - Wednesday, 26 March 2014, 23:20 GMT
Looks like there's some other device nodes created:

https://github.com/NVIDIA/nvidia-modprobe/blob/master/nvidia-modprobe.c#L194

These seem to be character devices with major 195. Can you add:

c /dev/nvidiactl0 0666 - - - 195:0

I'm not 100% sure about the device node name here. If it isn't nvidiactl0, it's nvidia0. I'd also be interested to know what the minimum permissions are for both of these nodes we're adding...
Comment by Dave Reisner (falconindy) - Wednesday, 26 March 2014, 23:21 GMT
All else fails, you can probably run the vectorAdd example under strace, filtering on the 'open' syscalls to figure out what the binary is accessing.
Comment by Felix Yan (felixonmars) - Thursday, 27 March 2014, 01:57 GMT
The nvidia0 and nvidiactl device are already created with major 195,

The problem is, the major for nvidia-uvm differs from machine to machine (and maybe even from boot to boot if hardware replaced?), the official CUDA guide [1] uses the following line to get it:

grep nvidia-uvm /proc/devices | awk '{print $1}'

I'm not sure how to implement this correctly in tmpfiles.d way though. Any ideas?

[1] http://developer.download.nvidia.com/compute/cuda/6_0/rc/docs/CUDA_Getting_Started_Linux.pdf
Comment by Felix Yan (felixonmars) - Thursday, 27 March 2014, 03:22 GMT
I managed to write an ugly udev rule to create the node (I'm completely new to udev), but as this article [1] suggests, this is a wrong way to go (but what is the correct way then?).

Quoting:
> Writing rules is not a workaround for the problem where no device nodes for your particular device exist

Anyway, the rule & script I wrote (works for me):

/etc/udev/rules.d/60-nvidia-uvm.rules
KERNEL=="nvidia_uvm", RUN+="/usr/local/bin/nvidia-uvm-probe"

/usr/local/bin/nvidia-uvm-probe
#!/bin/sh
MAJOR=$(grep nvidia-uvm /proc/devices | awk '{print $1}')
/usr/bin/mknod -m 660 /dev/nvidia-uvm c $MAJOR 0
/usr/bin/chgrp video /dev/nvidia-uvm

[1] http://www.reactivated.net/writing_udev_rules.html

================= UPDATE =================
Finally worked out an one liner udev rule (still not in the right way though):

/etc/udev/rules.d/60-nvidia-uvm.rules
KERNEL=="nvidia_uvm", RUN+="/usr/bin/bash -c '/usr/bin/mknod -m 660 /dev/nvidia-uvm c $(grep nvidia-uvm /proc/devices | cut -d \ -f 1) 0; /usr/bin/chgrp video /dev/nvidia-uvm'"
(There're two spaces after cut -d \, but flyspray ate one when displaying :/)
Comment by Carlos Silva (r3pek) - Thursday, 27 March 2014, 03:27 GMT
Since the device doesn't have an index, I suppose SLI/Multicard configurations only have one device too right?
Comment by Felix Yan (felixonmars) - Thursday, 27 March 2014, 16:13 GMT
I just moved the binary from nvidia/nvidia-lts packages to nvidia-utils. Please downgrade nvidia/nvidia-lts to [extra] version if any, and upgrade to testing/nvidia-utils 334.21-7. Please do let me know if everything related to cuda/opencl back to normal.

Loading...