摘要 |
A method and apparatus is described for parallel debugging on the data nodes of a parallel computer system. A data template associated with the debugger can be used as a reference to the common data on the nodes. The application or data contained on the compute nodes diverges from the data template at the service node during the course of program execution, so that pieces of the data are different at each of the nodes at some time of interest. For debugging, the compute nodes search their own memory image for checksum matches with the template and produces new data blocks with checksums that didn't exist in the data template, and a template of references to the original data blocks in the template. Examples herein include an application of the rsync protocol, compression and network broadcast to improve debugging in a massively parallel computer environment.
|