FS#75532 - [python-tensorflow-opt-cuda] Package not built with compute capability 5.0 and higher

Attached to Project: Community Packages
Opened by WhoseTheNerd (WhoseTheNerd) - Sunday, 07 August 2022, 12:02 GMT
Last edited by Sven-Hendrik Haase (Svenstaro) - Monday, 08 August 2022, 17:11 GMT
Task Type Bug Report
Category Packages
Status Closed
Assigned To Sven-Hendrik Haase (Svenstaro)
Architecture All
Severity Low
Priority Normal
Reported Version
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

Description: Running any tensorflow application with Nvidia graphics card with cuda compute capability 5.0 results in " ./tensorflow/core/kernels/random_op_gpu.h:244] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), key, counter, gen, data, size, dist) status: INTERNAL: no kernel image is available for execution on the device". Earlier in the logs it says that tensorflow wasn't built with compute capability 5.0: "W tensorflow/core/common_runtime/gpu/gpu_device.cc:1943] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 5.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer." However jit-compilation doesn't occur, cpu usage graphs see no high utilization and stays low, about 1%.


Additional info:
* package version(s) 2.9.1-2
* config and/or log files etc.
* link to upstream bug report, if any

Steps to reproduce:
Run any tensorflow application requiring GPU, like training a neural network, with compute capability 5.0. I'm using GTX 750 Ti
This task depends upon

Closed by  Sven-Hendrik Haase (Svenstaro)
Monday, 08 August 2022, 17:11 GMT
Reason for closing:  Won't fix
Additional comments about closing:  We won't support officially deprecated architectures. Users should compile their own versions of tensorflow if that is required.
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 07 August 2022, 15:02 GMT
Yeah, that seems about right. We build it like this: export TF_CUDA_COMPUTE_CAPABILITIES=sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86,compute_86

Your GPU seems older than that: https://developer.nvidia.com/cuda-gpus#compute

The set we build with is current non-deprecated set for CUDA. Building with older architectures might be possible but it's deprecated by NVIDIA and so we're keeping to the officially supported architectures. I think your best bet is to compile it yourself and hope it still works if you need to run on that old GPU for some reason.
Comment by WhoseTheNerd (WhoseTheNerd) - Sunday, 07 August 2022, 15:08 GMT
I have tried to build tensorflow from source, but had failure with builds failing, 404s and etc. I will look into building from scratch, but AUR packages are out-of-date and build instructions are very short and non-explanatory for failures that might happen.
Comment by WhoseTheNerd (WhoseTheNerd) - Sunday, 07 August 2022, 16:07 GMT
I did some more digging and found that pip package tensorflow-gpu supports compute capability 5.0. I think that CUDA compute capability support should match upstream, tensorflow, not to Nvidia's actively maintained list.
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 07 August 2022, 16:18 GMT
Why try the AUR packages OR building straight upstream? Did you run into trouble building the official Arch package with your architecture added? Should be quite easy to do.
Comment by WhoseTheNerd (WhoseTheNerd) - Sunday, 07 August 2022, 16:56 GMT
"Should be quite easy to do."

Wish that was the case...

[whosethenerd@whosethenerd-pc ~/build/svntogit-community/tensorflow/trunk]$ nano PKGBUILD
[whosethenerd@whosethenerd-pc ~/build/svntogit-community/tensorflow/trunk]$ makepkg -si
==> Making package: tensorflow 2.9.1-2 (Sun 07 Aug 2022 19:54:13 EEST)
==> Checking runtime dependencies...
==> Checking buildtime dependencies...
==> Retrieving sources...
-> Downloading tensorflow-2.9.1.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 63.5M 0 63.5M 0 0 5805k 0 --:--:-- 0:00:11 --:--:-- 7335k
-> Found fix-c++17-compat.patch
==> Validating source files with sha512sums...
tensorflow-2.9.1.tar.gz ... Passed
fix-c++17-compat.patch ... Passed
==> Extracting sources...
-> Extracting tensorflow-2.9.1.tar.gz with bsdtar
==> Starting prepare()...
==> Starting build()...
Building without cuda and without non-x86-64 optimizations
You have bazel 5.2.0 installed.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=mkl_aarch64 # Build with oneDNN and Compute Library for the Arm Architecture (ACL).
--config=monolithic # Config for mostly static monolithic build.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v1 # Build with TensorFlow 1 API instead of TF 2 API.
Preconfigured Bazel build configs to DISABLE default on features:
--config=nogcp # Disable GCP support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=236
INFO: Reading rc options for 'build' from /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc:
Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc:
'build' options: --define framework_shared_object=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library
INFO: Reading rc options for 'build' from /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.tf_configure.bazelrc:
'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python --action_env PYTHON_LIB_PATH=/usr/lib/python3.10/site-packages --python_path=/usr/bin/python --define=with_xla_support=true --action_env TF_SYSTEM_LIBS=boringssl,curl,cython,gif,icu,libjpeg_turbo,lmdb,nasm,png,pybind11,zlib
INFO: Reading rc options for 'build' from /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc:
'build' options: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/tfrt/common,tensorflow/core/tfrt/eager,tensorflow/core/tfrt/eager/backends/cpu,tensorflow/core/tfrt/eager/backends/gpu,tensorflow/core/tfrt/eager/core_runtime,tensorflow/core/tfrt/eager/cpp_tests/core_runtime,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils
INFO: Found applicable config definition build:short_logs in file /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:mkl in file /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc: --define=build_with_mkl=true --define=enable_mkl=true --define=tensorflow_mkldnn_contraction_kernel=0 --define=build_with_openmp=true -c opt
INFO: Found applicable config definition build:linux in file /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc: --copt=-w --host_copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --config=dynamic_kernels --distinct_host_configuration=false --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
DEBUG: Rule 'io_bazel_rules_docker' indicated that a canonical reproducible form can be obtained by modifying arguments shallow_since = "1596824487 -0400"
DEBUG: Repository io_bazel_rules_docker instantiated at:
/home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/WORKSPACE:23:14: in <toplevel>
/home/whosethenerd/build/svntogit-community/tensorflow/trunk/src/tensorflow-2.9.1/tensorflow/workspace0.bzl:107:34: in workspace
/home/whosethenerd/.cache/bazel/_bazel_whosethenerd/cb8aecb20a72b139c69318309100993b/external/bazel_toolchains/repositories/repositories.bzl:35:23: in repositories
Repository rule git_repository defined at:
/home/whosethenerd/.cache/bazel/_bazel_whosethenerd/cb8aecb20a72b139c69318309100993b/external/bazel_tools/tools/build_defs/repo/git.bzl:199:33: in <toplevel>
INFO: Analyzed 4 targets (486 packages loaded, 27036 targets configured).
INFO: Found 4 targets...
[0 / 25] [Prepa] Writing file tensorflow/libtensorflow_cc.so.2.9.1-2.params
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.ExceptionInInitializerError
at com.google.devtools.build.lib.actions.ParameterFile.writeContent(ParameterFile.java:118)
at com.google.devtools.build.lib.actions.ParameterFile.writeParameterFile(ParameterFile.java:111)
at com.google.devtools.build.lib.analysis.actions.ParameterFileWriteAction$ParamFileWriter.writeOutputFile(ParameterFileWriteAction.java:170)
at com.google.devtools.build.lib.exec.FileWriteStrategy.beginWriteOutputToFile(FileWriteStrategy.java:58)
at com.google.devtools.build.lib.analysis.actions.FileWriteActionContext.beginWriteOutputToFile(FileWriteActionContext.java:49)
at com.google.devtools.build.lib.analysis.actions.AbstractFileWriteAction.beginExecution(AbstractFileWriteAction.java:66)
at com.google.devtools.build.lib.actions.Action.execute(Action.java:133)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:907)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1076)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1031)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:152)
at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:91)
at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:492)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:856)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:169)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:590)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make java.lang.String(byte[],byte) accessible: module java.base does not "opens java.lang" to unnamed module @3daa422a
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:191)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:184)
at com.google.devtools.build.lib.unsafe.StringUnsafe.<init>(StringUnsafe.java:75)
at com.google.devtools.build.lib.unsafe.StringUnsafe.initInstance(StringUnsafe.java:56)
at com.google.devtools.build.lib.unsafe.StringUnsafe.<clinit>(StringUnsafe.java:37)
... 21 more
==> ERROR: A failure occurred in build().
Aborting...
[whosethenerd@whosethenerd-pc ~/build/svntogit-community/tensorflow/trunk]$ pacman -Qs openjdk
local/jdk-openjdk 18.0.2.u9-1
OpenJDK Java 18 development kit
local/jdk11-openjdk 11.0.16.u8-2
OpenJDK Java 11 development kit
local/jre-openjdk 18.0.2.u9-1
OpenJDK Java 18 full runtime environment
local/jre-openjdk-headless 18.0.2.u9-1
OpenJDK Java 18 headless runtime environment
local/jre11-openjdk 11.0.16.u8-2
OpenJDK Java 11 full runtime environment
local/jre11-openjdk-headless 11.0.16.u8-2
OpenJDK Java 11 headless runtime environment
[whosethenerd@whosethenerd-pc ~/build/svntogit-community/tensorflow/trunk]$
Comment by Sven-Hendrik Haase (Svenstaro) - Sunday, 07 August 2022, 17:25 GMT
You should use extra-x86_64-build (instead of makepkg) to ensure that you get a package built in a clean environment. That should make it work. Word of warning though: I'm building this on 64 threads with 128GiB of RAM and it takes 3h to build.

Loading...