Dear Dave,
Thanks again for your help. I continued testing the svpost and maybe we are looking in the wrong direction. The fact that -sn -ph works with the timestep 2300 and not with 2600 makes me think that something got corrupted in the middle. Furthermore, I tried converting the second cardiac cycle instead of the third (1000->2000 instead of 2000->3000) and everything worked, meaning that memory is not an issue.
These simulations were run in a cluster with a limited wall time and were therefore restarted at some point, which actually matches the timestep where the svpost crashes. I checked wether there are restart.*.* files missing, with different permissions or even if there is a jump in the Q and P hist to see whether the restart was done appropriately, but everything is smooth and continuous. Do you have any hint on this direction?
Cheers,
Rodrigo
Issue post-processing - svpost
- Rodrigo Romarowski
- Posts: 16
- Joined: Sat Apr 15, 2017 10:06 am
- David Parker
- Posts: 1719
- Joined: Tue Aug 23, 2005 2:43 pm
Re: Issue post-processing - svpost
Hi Rodrigo,
Good observation! Looking at the logs you sent me it seems that some of the files for steps 2400 and 2600 are corrupted.
You could just run svpost on the restart files that are good or even rerun the simulation for the missing time steps.
svpost seems to handle some cases of corrupted files (e.g. truncated files) but segfaults if the restart file header is corrupted. I've added a bit more error handling to svpost to check for that, you can get the latest version from https://github.com/ktbolt/svSolver/tree/error-checking.
Could you also put the restart.2600.106 file someplace I can download it? I'd like to see how it is corrupted so I can make sure to check for that case.
Cheers,
Dave
Good observation! Looking at the logs you sent me it seems that some of the files for steps 2400 and 2600 are corrupted.
You could just run svpost on the restart files that are good or even rerun the simulation for the missing time steps.
svpost seems to handle some cases of corrupted files (e.g. truncated files) but segfaults if the restart file header is corrupted. I've added a bit more error handling to svpost to check for that, you can get the latest version from https://github.com/ktbolt/svSolver/tree/error-checking.
Could you also put the restart.2600.106 file someplace I can download it? I'd like to see how it is corrupted so I can make sure to check for that case.
Cheers,
Dave
- Rodrigo Romarowski
- Posts: 16
- Joined: Sat Apr 15, 2017 10:06 am
Re: Issue post-processing - svpost
Dear Dave,
I have compiled the new code and the simulation I was using did reduce the results correctly, meaning that the file that gave the segmentation fault was probably not corrupted.
However, some other patients (simulations) still have the same problem. Unfortunately, this random behavior of the error makes it complicated to detect where is the problem. We cannot run svpost in only one part of the cycle since the parameters we're looking for would be incomplete. Furthermore, running the entire simulation at once would be impossible due to the wall time restrictions of the cluster and the limited scalability of the solver for meshes of these sizes.
I attach the logfile for another simulation using the error-checking branch as well as the restart.2660.130 which is the file that is creating trouble in this case. If there is any way I can send you all the procs-case folder which is about 20GB it would be easier to troubleshoot.
Best,
Rodrigo
I have compiled the new code and the simulation I was using did reduce the results correctly, meaning that the file that gave the segmentation fault was probably not corrupted.
However, some other patients (simulations) still have the same problem. Unfortunately, this random behavior of the error makes it complicated to detect where is the problem. We cannot run svpost in only one part of the cycle since the parameters we're looking for would be incomplete. Furthermore, running the entire simulation at once would be impossible due to the wall time restrictions of the cluster and the limited scalability of the solver for meshes of these sizes.
I attach the logfile for another simulation using the error-checking branch as well as the restart.2660.130 which is the file that is creating trouble in this case. If there is any way I can send you all the procs-case folder which is about 20GB it would be easier to troubleshoot.
Best,
Rodrigo
- Attachments
-
- log_and_restart.zip
- (664.92 KiB) Downloaded 21 times