FS#62282 : [cuda] $PATH and $LD_LIBRARY_PATH were not updated automatically

FS#62282 - [cuda] $PATH and $LD_LIBRARY_PATH were not updated automatically

Attached to Project: Community Packages
Opened by Zhen Xi (Mayrixon) - Tuesday, 09 April 2019, 00:12 GMT
Last edited by Konstantin Gizdov (kgizdov) - Thursday, 11 April 2019, 16:50 GMT

Task Type	Bug Report
Category	Packages
Status	Closed
Assigned To	Sven-Hendrik Haase (Svenstaro) Konstantin Gizdov (kgizdov)
Architecture	x86_64
Severity	Medium
Priority	Normal
Reported Version
Due in Version	Undecided
Due Date	Undecided
Percent Complete
Votes	1 Julien Chappuis (Jubijub) (2019-04-10)
Private	No

Details

Description:
The package requiring cuda, such as python-tensorflow-cuda cannot find cuda *.so automatically.

Additional info:
* package version(s)
cuda 10.1.105-6

* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:

This task depends upon

Closed by Konstantin Gizdov (kgizdov)
Thursday, 11 April 2019, 16:50 GMT
Reason for closing: Fixed
Additional comments about closing: glibc-2.28-6
cuda-10.1.105-8

Comment by Zhen Xi (Mayrixon) - Tuesday, 09 April 2019, 00:15 GMT

Sorry for misclick.

Package versions:
cuda 10.1.105-6
python-tensorflow-version 1.13.1-4

Logs:
2019-04-09 01:14:15.966295: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-09 01:14:15.983045: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4009500000 Hz
2019-04-09 01:14:15.983530: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564380471ef0 executing computations on platform Host. Devices:
2019-04-09 01:14:15.983542: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-04-09 01:14:16.038294: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-09 01:14:16.038739: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56437f2f9fd0 executing computations on platform CUDA. Devices:
2019-04-09 01:14:16.038753: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-09 01:14:16.038956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.39GiB
2019-04-09 01:14:16.038965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-09 01:14:16.282489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-09 01:14:16.282508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-09 01:14:16.282512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-09 01:14:16.282705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7120 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-04-09 01:14:17.015797: I tensorflow/stream_executor/dso_loader.cc:142] Couldn't open CUDA library libcublas.so.10.1. LD_LIBRARY_PATH:
2019-04-09 01:14:17.015819: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcublas.so.10.1; dlerror: libcublas.so.10.1: cannot open shared object file: No such file or directory

The bug could be fixed by adding the following commands in .bashrc or .zshrc:
export PATH=/opt/cuda/bin:$PATH
export LD_LIBRARY_PATH=/opt/cuda/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

Comment by Konstantin Gizdov (kgizdov) - Wednesday, 10 April 2019, 11:13 GMT

python-tensorflow does not depend on or use cuda. Only python-tensorflow-cuda or python-tensorflow-opt-cuda try to look for cuda, so I'm very sceptical this log was produced by python-tensorflow.

Please describe in detail you setup (CPU, GPU, relevant installed packages) and a full steps to reproduce the issue.

Comment by Zhen Xi (Mayrixon) - Wednesday, 10 April 2019, 13:42 GMT

That is my fault. I am using python-tensorflow-cuda rather than python-tensorflow. The log was produced by python-tensorflow-cuda

Comment by Konstantin Gizdov (kgizdov) - Wednesday, 10 April 2019, 18:31 GMT

Could you please provide the details requested? It would also be nice to know if you have a hook to recompile nvidia modules on upgrade. Thanks

Comment by Zhen Xi (Mayrixon) - Wednesday, 10 April 2019, 19:43 GMT

hardware:
CPU: i7-6700k
GPU: GTX 1080

relevant installed packages:
linux 5.0.7.arch1-1
nvidia 418.56-7
nvidia-utils 418.56-1
cuda 10.1.105-6
cudnn 7.5.0.56-1
python 3.7.3-1
python-tensorflow-cuda 1.13.1-4

relevant settings:
/etc/mkinitcpio.conf
MODULES=(nvidia nvidia_modeset nvidia_uvm nvidia_drm)
HOOKS=(base udev autodetect modconf block filesystems keyboard fsck)

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nvidia-drm.modeset=1 nowatchdog"

/etc/pacman.d/hooks
[Trigger]
Operation=Install
Operation=Upgrade
Operation=Remove
Type=Package
Target=nvidia
Target=linux
# Change the linux part above and in the Exec line if a different kernel is used

[Action]
Description=Update Nvidia module in initcpio
Depends=mkinitcpio
When=PostTransaction
NeedsTargets
Exec=/bin/sh -c 'while read -r trg; do case $trg in linux) exit 0; esac; done; /usr/bin/mkinitcpio -P'

Shell variables:
$PATH
/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/lib:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/command-not-found:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/fzf:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/tmux:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/git:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/gitignore:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/pip:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/colorize:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/history:/home/zhen/.antigen/bundles/robbyrussell/oh-my-zsh/plugins/thefuck:/home/zhen/.antigen/bundles/Vifon/deer:/home/zhen/.antigen/bundles/supercrabtree/k:/home/zhen/.antigen/bundles/zsh-users/zsh-autosuggestions:/home/zhen/.antigen/bundles/zsh-users/zsh-completions
$LD_LIBRARY_PATH (There is no this variable)

Steps to reproduce:
1. Create a python script minimum_script.py as follows:
import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
2. Execute command python minimum_script.py
3. Logs as follows:
WARNING:tensorflow:From /usr/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-04-10 20:42:50.494741: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-10 20:42:50.515438: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4009500000 Hz
2019-04-10 20:42:50.515766: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55879a667250 executing computations on platform Host. Devices:
2019-04-10 20:42:50.515781: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-04-10 20:42:50.585591: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-10 20:42:50.586063: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x558799e29500 executing computations on platform CUDA. Devices:
2019-04-10 20:42:50.586075: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2019-04-10 20:42:50.586274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 6.12GiB
2019-04-10 20:42:50.586283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-10 20:42:50.842967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-10 20:42:50.842988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-10 20:42:50.842995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-10 20:42:50.843184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5892 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
Epoch 1/5
2019-04-10 20:42:50.989782: I tensorflow/stream_executor/dso_loader.cc:142] Couldn't open CUDA library libcublas.so.10.1. LD_LIBRARY_PATH:
2019-04-10 20:42:50.989802: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcublas.so.10.1; dlerror: libcublas.so.10.1: cannot open shared object file: No such file or directory
[1] 28870 abort (core dumped) python minimum_script.py

Comment by Konstantin Gizdov (kgizdov) - Thursday, 11 April 2019, 10:05 GMT

I am able to replicate. This turns out to be a regression in CUDA 10.1 shipped libs, where the object SONAME does not provide a 10.1 variant for all libs. There is now a version of cuda in testing that attempts to correct for this regression, but it requires a new release of glibc (ldconfig). Thus we cannot do much in the immediate term. My recommendation is to downgrade to cuda 10 and older tensorflow that was built against that (1.13.1-3 for example).

Comment by Konstantin Gizdov (kgizdov) - Thursday, 11 April 2019, 12:51 GMT

progress update - we now have glibc-2.28-6 in [testing] and cuda-10.1.105-8 in [community-testing] which should resolve the issue. Please try them and see if it works.

Comment by Zhen Xi (Mayrixon) - Thursday, 11 April 2019, 14:37 GMT

Updated packages:
glibc-2.28-6
cuda-10.1.105-8

The problem has been solved. Thank you!

	Tasks related to this task (0)

Duplicate tasks of this task (1)
~~FS#62258 - [cuda] CUDA samples don't compile on Arch install by default as~~

Arch Linux

FS#62282 - [cuda] $PATH and $LD_LIBRARY_PATH were not updated automatically

Details

Loading...