GTX TITAN Superclocked-memtestG80-UNDERclocking in Linux ?

MemtestG80 and MemtestCL are software-based tester to test for "soft errors" in GPU memory or logic for GPUs supporting CUDA or OpenCL.
POST REPLY
User avatar
Marek Maly
Posts: 2
Joined: Mon May 27, 2013 10:08 am

GTX TITAN Superclocked-memtestG80-UNDERclocking in Linux ?

Post by Marek Maly » Mon May 27, 2013 3:20 pm

Dear all,

I have recently bought two "EVGA GTX TITAN Superclocked" GPUs for the scientific CUDA calcualtions
(MD in Amber12).

I did the first calculations (on both cards) with systems around 60K atoms without any problems (NPT, Langevin), but when I later tried with bigger systems (around 100K atoms) I obtained "classical" irritating errors

cudaMemcpy GpuBuffer::Download failed unspecified launch failure

just after few thousands of MD steps.

So this was obviously the reason for memtestG80 tests.

So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz ) and then tested
just small part of memory GPU (200 MB) using 100 iterations.

On both cards I have obtained huge amount of errors but "just" on
"Random blocks:". 0 errors in all remaining tests in all iterations.

------THE LAST ITERATION AND FINAL RESULTS-------

Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
Moving Inversions (ones and zeros): 0 errors (6 ms)
Memtest86 Walking 8-bit: 0 errors (53 ms)
True Walking zeros (8-bit): 0 errors (26 ms)
True Walking ones (8-bit): 0 errors (26 ms)
Moving Inversions (random): 0 errors (6 ms)
Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
Memtest86 Walking ones (32-bit): 0 errors (104 ms)
Random blocks: 1369863 errors (27 ms)
Memtest86 Modulo-20: 0 errors (215 ms)
Logic (one iteration): 0 errors (4 ms)
Logic (4 iterations): 0 errors (8 ms)
Logic (shared memory, one iteration): 0 errors (8 ms)
Logic (shared-memory, 4 iterations): 0 errors (25 ms)

Final error count after 100 iterations over 200 MiB of GPU memory: 171106710 errors

------------------------------------------

I have some questions and would be really grateful for your comments.

#1
Is the actual version of memtestG80 (1.1) able to provide correct/reliable testing
results in case of GTX TITAN GPUs ? This could be important in case of eventual
warranty claim where my problems with Amber calculations might not be sufficient ...

For the moment I will assume that version 1.1 provides reliable results also on GTX Titan.

#2
Is the fact that the errors are present exclusively in "Random blocks:" and also the fact
that I could simulate without interruptions smaller systems (not speaking here about reliability
of such results) symptom of the most probable overclocking issue ? Or it clearly point
to another reason e.g. that both cards are bad anyway and the only way is to replace them ?


#3
Regarding overclocking, using the deviceQuery I found out that under linux both cards run
automatically using boost shader/GPU frequency which is here 928 MHz (the basic value for these factory OC cards is 876 MHz). deviceQuery reported Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but maybe the quantity which is reported by deviceQuery "Memory Clock rate" is different from the product specification "Memory Clock" . It seems that "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just deviceQuery is not able to read this spec. properly
in Titan GPU ?

Anyway for the moment I assume that the problem might be due to the high shader/GPU frequency.
To verify this hypothesis one should perhaps UNDERclock to basic frequency which is in this
model 876 MHz or even to the TITAN REFERENCE frequency which is 837 MHz.

Obviously I am working with these cards under linux (CentOS 2.6.32-358.6.1.el6.x86_64) and as I found
the OC tools under linux are in fact limited just to NVclock utility, which is unfortunately
out of date (at least speaking about the GTX Titan ). I have obtained this message when I wanted
just to let NVclock utility to read and print shader and memory frequencies of my Titan's:

-------------------------------------------------------------------

[root@dyn-138-272 NVCLOCK]# nvclock -s --speeds
Card: Unknown Nvidia card
Card number: 1
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz

Card: Unknown Nvidia card
Card number: 2
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz


-------------------------------------------------------------------


I would be really grateful for some tips regarding some "NVclock alternatives",
but after wasting some hours with googling it seems that there is no other Linux
tool with NVclock functionality. So the only possibility is here perhaps to edit
GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker, NVflash) but obviously
I would like to rather avoid such approach as using it means perhaps also
to void the warranty even if I am going to underclock the GPUs not to overclock them.
So before this eventual step (GPU bios editing) I would like to have some approximative estimate
of the probability, that the problems are here really because of the overclocking
(too high (boost) default shader frequency).



#4
My HW/Os configuration

motherboard: ASUS P9X79 PRO
CPU: Intel Core i7-3930K
RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
CASE: CoolerMaster Dominator CM-690 II Advanced,
Power: Corsair AX1200
GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
cooler: Cooler Master Hyper 412 SLIM

OS: CentOS (2.6.32-358.6.1.el6.x86_64)
driver version: 319.17
cudatoolkit_5.0.35_linux_64_rhel6.x

The computer is in air-conditioned room with permanent external temperature around 18°C



Thanks a lot in advance for any comment !

Best wishes,

Marek

User avatar
Marek Maly
Posts: 2
Joined: Mon May 27, 2013 10:08 am

Re: GTX TITAN Superclocked-memtestG80-UNDERclocking in Linux

Post by Marek Maly » Mon May 27, 2013 6:22 pm

Hi again,

Thanks to one valuable response in EVGA forum ( http://www.evga.com/forums/tm.aspx?m=1940998 )
I finally learned that except "CLOSED" variants of memtestG80/memtestCL which are
available here : https://simtk.org/project/xml/downloads ... oup_id=385 there are also
available "OPEN" variants which seem to be more up to date and which differ from "CLOSED" variants
at least in one thing: There was fixed sync error in random blocks test :))
here are the OPEN version src links:

memtestG80
https://github.com/ihaque/memtestG80
here is the sync fix code
https://github.com/ihaque/memtestG80/co ... b12341cc4c

memtestCL
https://github.com/ihaque/memtestCL
and fix code link
https://github.com/ihaque/memtestCL/com ... cd9e3bd5ff

When I tested my factory OC TITANS with patched (OPEN) version of memtestG80 I obtained
0 errors !!!

For the moment (just few minutes ago) I have tested 5 GB of memory using 300 iterations
( ./memtestG80 -g 1 5000 300 ) with zero number of errors (on both GPUs).

So it seems that my original problem with particular MD calculation which inspired me to test my new OC Titan cards with memtestG80 do not have origin in GPU hard/soft errors.

POST REPLY