Welcome to Open-Discussion

MemtestG80 and MemtestCL are software-based tester to test for "soft errors" in GPU memory or logic for GPUs supporting CUDA or OpenCL.
POST REPLY
User avatar
SimTK.org Admin
Posts: 0
Joined: Wed Dec 31, 1969 5:00 pm

Welcome to Open-Discussion

Post by SimTK.org Admin » Sun Apr 05, 2009 4:21 pm

Welcome to Open-Discussion

User avatar
Marek Maly
Posts: 2
Joined: Wed Aug 15, 2012 2:08 pm

Re: Welcome to Open-Discussion

Post by Marek Maly » Wed Aug 15, 2012 5:02 pm

Hello all,

I bought 2 new GPUs ( Gigabyte GeForce GTX580, 3O72MB ) and installed
them into my new "bulldozer" machine:

---------------------------------------------------------------
motherboard: GIGABYTE 990FXA-UD7
CPU: AMD FX-8150
Power: Enermax MODU87+ 800W Lot6 Gold
OS: CentOS

--------------------------------
CudaToolkit and NVIDIA driver installed

cudatoolkit_4.2.9_linux_64_rhel6.0
devdriver_4.2_linux_64_295.41
---------------------------------------------------------------

as I got some unexpected errors during my first MD runs (using Amber) I was trying to "google" some solutions and I found SIMTK page.

I downloaded and compiled memtestG80-1.1 and tested both GPUs.
I tested each GPU for 1000 MB and for 2000 MB of memory with 100 iterations.
I repeated both tests three times. And here are the results.


GPU 0
FIRST RUN SECOND RUN THIRD RUN
1000 GB 100 iterations: 0 ERRORS 0 ERRORS 0 ERRORS
2000 GB 100 iterations: 312 ERRORS 0 ERRORS 11352 ERRORS




GPU 1

FIRST RUN SECOND RUN THIRD RUN
1000 GB 100 iterations: 137896 ERRORS 0 ERRORS 0 ERRORS
2000 GB 100 iterations: 0 ERRORS 0 ERRORS 0 ERRORS


Errors appeared just in ( Memtest86 Modulo-20, Random blocks and Moving Inversions (random)
but mainly in the first two) subtests.

My questions are:


#0
Why there are so different results between different runs of the same test (with the same setting i.e. memmory/iterations) on the given GPU ? Is it for example because the tests are partially stochastic (i.e. they are using random numbers) so in each test different parts of memory are tested ?

#1
How critical are the above reported results if I consider to use these GPUs for MD calculations ?

#2
What might be the cause of these errors (bad GPU, bad communication of GPU with motherboard ...).

#3
What I can try to eliminate/(decrease the amount) these errors ? To downclock GPUs ? To try some settings in BIOS ? To install some firmware ...

#4
If the above reported results are really critical and there is no way how to eliminate enough the errors, are these errors sufficient reason for the warranty claim of the given GPU/s ?

Thank you very much in advance for any comments !

Best wishes,

Marek







--
Tato zpráva byla vytvořena převratným poštovním klientem Opery: http://www.opera.com/mail/

POST REPLY