Page 1 of 1

Interpreting memtestG80 results

Posted: Fri Jan 04, 2013 2:11 pm
by dchev
I'm running memtestG80/CL on some M2075.
One particular GPU is giving me some trouble, it hangs memtestCL usually in less than 2k iterations.
memtestG80 reports some abnormally large errors (see below).
Those results are compounded by summation overflow into a 32-bit error count, I think.

My question is if these results might be due to say a driver/sw bug vs some kind of hardware error.
The nvidia-smi tool does not report any ECC errors on this device (yes, I have ECC enabled and I'm running a memory checker).
I'm running the latest commit (c4336a69fff07945c322d6c7fc40b0b12341cc4c) from https://github.com/ihaque/memtestG80

Any suggestions?
(Is there a better forum for this question? doesn't seem like much activity here)
Thanks

Test iteration 10 (GPU 5, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (188 ms)
Memtest86 Walking 8-bit: 0 errors (1504 ms)
True Walking zeros (8-bit): 0 errors (785 ms)
True Walking ones (8-bit): 0 errors (786 ms)
Moving Inversions (random): 0 errors (188 ms)
Memtest86 Walking zeros (32-bit): 4294967278 errors (137304 ms)
Memtest86 Walking ones (32-bit): 4294967232 errors (480034 ms)
Random blocks: 4294967294 errors (15001 ms)
Memtest86 Modulo-20: 4294967256 errors (300021 ms)
Logic (one iteration): 4294967294 errors (15001 ms)
Logic (4 iterations): 4294967294 errors (15001 ms)
Logic (shared memory, one iteration): 4294967294 errors (15001 ms)
Logic (shared-memory, 4 iterations): 4294967294 errors (15001 ms)

Test iteration 11 (GPU 5, 3500 MiB): 4294967164 errors so far
Moving Inversions (ones and zeros): 4294967294 errors (15001 ms)
^C

Re: Interpreting memtestG80 results

Posted: Mon May 27, 2013 9:35 am
by marmal
Dear all,

I have recently bought two "EVGA GTX TITAN Superclocked" GPUs for the scientific CUDA calcualtions
(MD in Amber12).

I did the first calculations (on both cards) with systems around 60K atoms without any problems (NPT, Langevin), but when I later tried with bigger systems (around 100K atoms) I obtained "classical" irritating errors

cudaMemcpy GpuBuffer::Download failed unspecified launch failure

just after few thousands of MD steps.

So this was obviously the reason for memtestG80 tests.

So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz ) and then tested
just small part of memory GPU (200 MB) using 100 iterations.

On both cards I have obtained huge amount of errors but "just" on
"Random blocks:". 0 errors in all remaining tests in all iterations.

------THE LAST ITERATION AND FINAL RESULTS-------

Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
Moving Inversions (ones and zeros): 0 errors (6 ms)
Memtest86 Walking 8-bit: 0 errors (53 ms)
True Walking zeros (8-bit): 0 errors (26 ms)
True Walking ones (8-bit): 0 errors (26 ms)
Moving Inversions (random): 0 errors (6 ms)
Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
Memtest86 Walking ones (32-bit): 0 errors (104 ms)
Random blocks: 1369863 errors (27 ms)
Memtest86 Modulo-20: 0 errors (215 ms)
Logic (one iteration): 0 errors (4 ms)
Logic (4 iterations): 0 errors (8 ms)
Logic (shared memory, one iteration): 0 errors (8 ms)
Logic (shared-memory, 4 iterations): 0 errors (25 ms)

Final error count after 100 iterations over 200 MiB of GPU memory: 171106710 errors

------------------------------------------

I have some questions and would be really grateful for your comments.

#1
Is the actual version of memtestG80 (1.1) able to provide correct/reliable testing
results in case of GTX TITAN GPUs ? This could be important in case of eventual
warranty claim where my problems with Amber calculations might not be sufficient ...

For the moment I will assume that version 1.1 provides reliable results also on GTX Titan.

#2
Is the fact that the errors are present exclusively in "Random blocks:" and also the fact
that I could simulate without interruptions smaller systems (not speaking here about reliability
of such results) symptom of the most probable overclocking issue ? Or it clearly point
to another reason e.g. that both cards are bad anyway and the only way is to replace them ?


#3
Regarding overclocking, using the deviceQuery I found out that under linux both cards run
automatically using boost shader/GPU frequency which is here 928 MHz (the basic value for these factory OC cards is 876 MHz). deviceQuery reported Memory Clock rate is 3004 MHz although "it" should be 6008 MHz but maybe the quantity which is reported by deviceQuery "Memory Clock rate" is different from the product specification "Memory Clock" . It seems that "Memory Clock rate" = "Memory Clock"/2. Am I right ? Or just deviceQuery is not able to read this spec. properly
in Titan GPU ?

Anyway for the moment I assume that the problem might be due to the high shader/GPU frequency.
To verify this hypothesis one should perhaps UNDERclock to basic frequency which is in this
model 876 MHz or even to the TITAN REFERENCE frequency which is 837 MHz.

Obviously I am working with these cards under linux (CentOS 2.6.32-358.6.1.el6.x86_64) and as I found
the OC tools under linux are in fact limited just to NVclock utility, which is unfortunately
out of date (at least speaking about the GTX Titan ). I have obtained this message when I wanted
just to let NVclock utility to read and print shader and memory frequencies of my Titan's:

-------------------------------------------------------------------

[root@dyn-138-272 NVCLOCK]# nvclock -s --speeds
Card: Unknown Nvidia card
Card number: 1
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz

Card: Unknown Nvidia card
Card number: 2
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz


-------------------------------------------------------------------


I would be really grateful for some tips regarding some "NVclock alternatives",
but after wasting some hours with googling it seems that there is no other Linux
tool with NVclock functionality. So the only possibility is here perhaps to edit
GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker, NVflash) but obviously
I would like to rather avoid such approach as using it means perhaps also
to void the warranty even if I am going to underclock the GPUs not to overclock them.
So before this eventual step (GPU bios editing) I would like to have some approximative estimate
of the probability, that the problems are here really because of the overclocking
(too high (boost) default shader frequency).



#4
My HW/Os configuration

motherboard: ASUS P9X79 PRO
CPU: Intel Core i7-3930K
RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
CASE: CoolerMaster Dominator CM-690 II Advanced,
Power: Corsair AX1200
GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
cooler: Cooler Master Hyper 412 SLIM

OS: CentOS (2.6.32-358.6.1.el6.x86_64)
driver version: 319.17
cudatoolkit_5.0.35_linux_64_rhel6.x

The computer is in air-conditioned room with permanent external temperature around 18°C



Thanks a lot in advance for any comment !

Best wishes,

Marek