Getting both OpenCL and CUDA working with OpenMM 6.1

Ondrej Marsalek · Post by **Ondrej Marsalek** » Wed Oct 22, 2014 7:41 pm

I am trying to install OpenMM 6.1 on Ubuntu 12.04 with CUDA 5.5 from the repository. When I install the pre-built binary, the CUDA platform is not found when I run the installation test script. With otherwise identical settings, it is found when I build from source and install.

However, when I build from source, I get OpenCL errors during linking, even though the OpenCL shared library is found - OPENCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so. Specifically, I get (with VERBOSE=1):

Code: Select all

Linking CXX executable ../../../TestOpenCLAndersenThermostat
cd /home/marsalek/build/OpenMM6.1-Build/platforms/opencl/tests && /usr/bin/cmake -E cmake_link_script CMakeFiles/TestOpenCLAndersenThermostat.dir/link.txt --verbose=1
/usr/bin/clang++   -O3 -DNDEBUG    -msse2 CMakeFiles/TestOpenCLAndersenThermostat.dir/TestOpenCLAndersenThermostat.cpp.o  -o ../../../TestOpenCLAndersenThermostat -rdynamic ../../../libOpenMMOpenCL.so ../../../libOpenMM.so -ldl -lOpenCL -lpthread -Wl,-rpath,/home/marsalek/build/OpenMM6.1-Build 
../../../libOpenMMOpenCL.so: undefined reference to `clRetainDevice'
../../../libOpenMMOpenCL.so: undefined reference to `clReleaseDevice'
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [TestOpenCLAndersenThermostat] Error 1
make[2]: Leaving directory `/home/marsalek/build/OpenMM6.1-Build'
make[1]: *** [platforms/opencl/tests/CMakeFiles/TestOpenCLAndersenThermostat.dir/all] Error 2
make[1]: Leaving directory `/home/marsalek/build/OpenMM6.1-Build'
make: *** [all] Error 2

Using gcc instead of clang results in the same error. Thank you for any suggestions.

Peter Eastman · Post by **Peter Eastman** » Thu Oct 23, 2014 10:50 am

Hi Ondrej,

The precompiled binaries were built against CUDA 6.0. CUDA releases are not binary compatible with each other, so OpenMM will only work with the version it was built against. That's why the precompiled one isn't working for you, but it does work when you compile it yourself against CUDA 5.5.

Peter

Ondrej Marsalek · Post by **Ondrej Marsalek** » Thu Oct 23, 2014 10:55 am

Ah, sorry about that, must have missed the version bump. I remembered that OpenMM 6.0 was compiled against CUDA 5.5.

Lee-Ping Wang · Post by **Lee-Ping Wang** » Thu Oct 23, 2014 11:04 am

Hi Ondrej,

The "undefined reference" error refers to a symbol (i.e. function name) that doesn't exist in the libraries that the compiler is trying to link against. Basically, the symbol needs to exist in the libraries for that command to work.

The "clRetainDevice" sounds like a part of the OpenCL library, so I would run "objdump -t" or "readelf -s" on your libOpenCL.so file and grep for "clRetainDevice" to see if the symbol exists.

Sometimes the symbol names are mangled in the library which means the compiler cannot find them. As a self taught "programmer", I have no idea why they do this.

Peter, please feel free to correct anything I said above that might be wrong.

Thanks,

- Lee-Ping

BTW: Ondrej is a postdoc in the Markland group just across the street.

Jason Swails · Post by **Jason Swails** » Thu Oct 23, 2014 11:23 am

Sometimes the symbol names are mangled in the library which means the compiler cannot find them. As a self taught "programmer", I have no idea why they do this.

I'm also a "self-taught programmer" (but hey! I took a Java class in high school... does that count?). However, name mangling is not done willy-nilly -- only when there is a good reason to do so. Specifically, names are mangled to prevent them from being used when they shouldn't.

For example, all Fortran subroutines are given a trailing _ (and all subroutine calls look for symbols with a trailing _), which is a standardized form of "name mangling". This is used to help differentiate between C and Fortran routines to accommodate cross-linking the two languages. Symbols defined inside Fortran modules are also name-mangled. This prevents you from calling subroutines defined inside a module without explicitly "using" that module (and also prohibits you from using private module members). Anything inside a C++ namespace is also name-mangled (including class methods and such) -- the name mangling in C++ can get complicated due to inheritance. Same with Python (which name-mangles dunder methods to support multiple inheritance, IIRC).

You can see where this is going... names are mangled when those symbols are defined in a specific namespace (usually in a compiler-dependent fashion, except for the trailing _ in Fortran subroutines). This is done because those symbols are not meant to be accessible in the global namespace. Of course, if you know how the names are mangled, you can (at least sometimes) call those functions from the global namespace, but that will not necessarily work on all compilers.

Anything that is defined in the so-called "global" namespace should not be mangled. I would expect that to apply to these symbols in the OpenCL library (i.e., they should not be mangled).

When I googled the error message, I got sent to https://bugs.launchpad.net/enblend/+bug/1305794 -- suggesting that the OpenCL installed on the system is broken (or too old). I would also suggest making sure you are using the vendor-supplied OpenCL (from your GPU vendor).

Peter Eastman · Post by **Peter Eastman** » Thu Oct 23, 2014 11:40 am

I don't think name mangling is the problem. That's a standard C function.

Where did that OpenCL library come from? I don't have any such library in that location on my systems. Instead, I'm compiling against the one in /usr/lib, which is where the NVIDIA driver installer puts it.

Perhaps that came from Mesa? I know they're working on OpenCL support, but it's still very incomplete, so it's not surprising if compiling against their library doesn't work.

Peter

Lee-Ping Wang · Post by **Lee-Ping Wang** » Thu Oct 23, 2014 12:00 pm

Hi Jason,

Thanks for the explanation.

I believe Ondrej was using the NVidia drivers and corresponding OpenCL from the Ubuntu repository. It is probably a version problem then - I would recommend installing the latest NVidia driver and corresponding OpenCL from the NVidia website.

As an aside, installing NVidia drivers outside the package manager could sometimes be a pain.

Thanks,

- Lee-Ping

Ondrej Marsalek · Post by **Ondrej Marsalek** » Thu Oct 23, 2014 12:07 pm

Hi Lee-Ping, I should have clarified that the type of issue and concepts involved are not a problem. I was interested specifically in these undefined symbols - perhaps it is a known issue or someone has encountered it before. The implicit question (which I should have made explicit) was: "Why does linking against OpenCL fail, even though I have OpenCL from Nvidia installed?" Note that the library is in ELF format and neither nm nor objdump will extract the symbols from it - readelf does work, though, see below.

Peter, the library comes from the nvidia-libopencl1-331 package, which I have installed on my Ubuntu 14.04 LTS. I have checked with:

Code: Select all

nvidia-libopencl1-304: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
nvidia-libopencl1-304: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0
nvidia-libopencl1-304: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0
nvidia-libopencl1-304-updates: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
nvidia-libopencl1-304-updates: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0
nvidia-libopencl1-304-updates: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0
nvidia-libopencl1-331: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
nvidia-libopencl1-331: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0
nvidia-libopencl1-331: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0
nvidia-libopencl1-331-updates: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
nvidia-libopencl1-331-updates: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0
nvidia-libopencl1-331-updates: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0
ocl-icd-libopencl1: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
ocl-icd-libopencl1: /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0

Looking at the library:

Code: Select all

$ readelf -Ws /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 | grep Device
    25: 0000000000002d40   146 FUNC    GLOBAL DEFAULT   10 clGetDeviceIDs
    31: 0000000000001e70    22 FUNC    GLOBAL DEFAULT   10 clGetDeviceInfo

$ readelf -Ws /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 | grep clRetain
    41: 00000000000020e0    24 FUNC    GLOBAL DEFAULT   10 clRetainMemObject
    59: 0000000000002470    24 FUNC    GLOBAL DEFAULT   10 clRetainEvent
    74: 00000000000021b0    24 FUNC    GLOBAL DEFAULT   10 clRetainSampler
    77: 0000000000001ef0    22 FUNC    GLOBAL DEFAULT   10 clRetainContext
    95: 0000000000002280    24 FUNC    GLOBAL DEFAULT   10 clRetainProgram
    98: 0000000000001f80    22 FUNC    GLOBAL DEFAULT   10 clRetainCommandQueue
    99: 0000000000002380    24 FUNC    GLOBAL DEFAULT   10 clRetainKernel

I see that the symbols are indeed missing. Further investigation online suggests that clRetainDevice and clReleaseDevice and part of OpenCL 1.2, but are missing from OpenCL 1.1, which my package seems to provide (though it is not stated explicitly in the package description. Peter, can you confirm that OpenCL 1.2 is a requirement of OpenMM 6.1?

I think it would be very useful to have the required versions of CUDA and OpenCL stated somewhere loud and clear - for example in the release message here on the forum, but possibly also in a readme file in the source distribution.

Ondrej Marsalek · Post by **Ondrej Marsalek** » Thu Oct 23, 2014 12:10 pm

Lee-Ping, that is exactly the case and because of the potential pain that you mention, I try to stick to the stable version from the repo, when possible. For other reason, I have decided to update to 14.10 anyway, so I might get a sufficiently new version from the repo anyway.

Peter Eastman · Post by **Peter Eastman** » Thu Oct 23, 2014 12:20 pm

No, it works fine with OpenCL 1.1. What is OPENCL_INCLUDE_DIR set to? All references to OpenCL 1.2 symbols are #ifdef'd based on CL_VERSION_1_2, and we never use any of them. But if you're compiling against a 1.2 header and a 1.1 library, that would cause the problem.

Peter

Getting both OpenCL and CUDA working with OpenMM 6.1

Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1

Re: Getting both OpenCL and CUDA working with OpenMM 6.1