Interpreting memtestG80 results
Posted: Fri Jan 04, 2013 2:11 pm
I'm running memtestG80/CL on some M2075.
One particular GPU is giving me some trouble, it hangs memtestCL usually in less than 2k iterations.
memtestG80 reports some abnormally large errors (see below).
Those results are compounded by summation overflow into a 32-bit error count, I think.
My question is if these results might be due to say a driver/sw bug vs some kind of hardware error.
The nvidia-smi tool does not report any ECC errors on this device (yes, I have ECC enabled and I'm running a memory checker).
I'm running the latest commit (c4336a69fff07945c322d6c7fc40b0b12341cc4c) from https://github.com/ihaque/memtestG80
Any suggestions?
(Is there a better forum for this question? doesn't seem like much activity here)
Thanks
Test iteration 10 (GPU 5, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (188 ms)
Memtest86 Walking 8-bit: 0 errors (1504 ms)
True Walking zeros (8-bit): 0 errors (785 ms)
True Walking ones (8-bit): 0 errors (786 ms)
Moving Inversions (random): 0 errors (188 ms)
Memtest86 Walking zeros (32-bit): 4294967278 errors (137304 ms)
Memtest86 Walking ones (32-bit): 4294967232 errors (480034 ms)
Random blocks: 4294967294 errors (15001 ms)
Memtest86 Modulo-20: 4294967256 errors (300021 ms)
Logic (one iteration): 4294967294 errors (15001 ms)
Logic (4 iterations): 4294967294 errors (15001 ms)
Logic (shared memory, one iteration): 4294967294 errors (15001 ms)
Logic (shared-memory, 4 iterations): 4294967294 errors (15001 ms)
Test iteration 11 (GPU 5, 3500 MiB): 4294967164 errors so far
Moving Inversions (ones and zeros): 4294967294 errors (15001 ms)
^C
One particular GPU is giving me some trouble, it hangs memtestCL usually in less than 2k iterations.
memtestG80 reports some abnormally large errors (see below).
Those results are compounded by summation overflow into a 32-bit error count, I think.
My question is if these results might be due to say a driver/sw bug vs some kind of hardware error.
The nvidia-smi tool does not report any ECC errors on this device (yes, I have ECC enabled and I'm running a memory checker).
I'm running the latest commit (c4336a69fff07945c322d6c7fc40b0b12341cc4c) from https://github.com/ihaque/memtestG80
Any suggestions?
(Is there a better forum for this question? doesn't seem like much activity here)
Thanks
Test iteration 10 (GPU 5, 3500 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (188 ms)
Memtest86 Walking 8-bit: 0 errors (1504 ms)
True Walking zeros (8-bit): 0 errors (785 ms)
True Walking ones (8-bit): 0 errors (786 ms)
Moving Inversions (random): 0 errors (188 ms)
Memtest86 Walking zeros (32-bit): 4294967278 errors (137304 ms)
Memtest86 Walking ones (32-bit): 4294967232 errors (480034 ms)
Random blocks: 4294967294 errors (15001 ms)
Memtest86 Modulo-20: 4294967256 errors (300021 ms)
Logic (one iteration): 4294967294 errors (15001 ms)
Logic (4 iterations): 4294967294 errors (15001 ms)
Logic (shared memory, one iteration): 4294967294 errors (15001 ms)
Logic (shared-memory, 4 iterations): 4294967294 errors (15001 ms)
Test iteration 11 (GPU 5, 3500 MiB): 4294967164 errors so far
Moving Inversions (ones and zeros): 4294967294 errors (15001 ms)
^C