摘要 |
Improved transposition of a matrix in a computer system may be accomplished while utilizing at most a single permutation vector. This greatly improves the speed and parallelability of the transpose operation. For a standard rectangular matrix having M rows and N columns and a size MxN, first n and q are determined, wherein N=n*q, and wherein Mxq represents a block size and wherein N is evenly divisible by p. Then, the matrix is partitioned into n columns of size Mxq. Then for each column n, elements are sequentially read within the column row-wise and sequentially written into a cache, then sequentially read from the cache and sequentially written row-wise back into the matrix in a memory in a column of size qxM. A permutation vector may then be applied to the matrix to arrive at the transpose. This method may be modified for special cases, such as square matrices, to further improve efficiency.
|