发明名称 SIMD IMPLEMENTATION OF STENCIL CODES
摘要 Implementing a 1D stencil code via SIMD instructions on a computer with vector registers having N processing elements (PEs), among them a set of coefficient vector registers, a set of at most N data vector registers, and a set of result vector registers. The M stencil coefficients are loaded in a particular pattern into M+N−1 coefficient vector registers. Successive sets of N consecutive data values are received, and each data value of a set is loaded into all PEs of a data vector register of the set of data vector registers. The result vector registers accumulate sums of products of consecutive coefficient vector registers with corresponding data vector registers. The contents of any result vector register containing a sum of all coefficient vector register-data vector register products is output, and the result vector register is reused for accumulating.
申请公布号 US2016283441(A1) 申请公布日期 2016.09.29
申请号 US201514670684 申请日期 2015.03.27
申请人 International Business Machines Corporation 发明人 Grinberg Leopold;Magerlein Karen A.
分类号 G06F17/16;G06F9/30;G06F15/80 主分类号 G06F17/16
代理机构 代理人
主权项 1. A method for implementing a 1D stencil code via SIMD instructions on a computer that includes a plurality of vector registers, each having a number N of processing elements (PEs), the plurality of vector registers including a set of vector registers used to store stencil coefficients, a set of vector registers used to store data values, and a set of vector registers used to store results, the method comprising: receiving, by a processor, a number M of ordered stencil coefficients; loading, by the processor, stencil coefficients into M+N−1 consecutive coefficient vector registers of the set of coefficient vector registers, such that: the first PEs of the first M coefficient vector registers contain the M stencil coefficients in order, and the remaining N−1 PEs of the first coefficient vector register and the first PEs of the N−1 coefficient vector registers following the first M coefficient vector registers contain zeros;the subsequent PEs of the M+N−2 coefficient vector registers following the first coefficient vector register contain the value stored in the previous PE of the previous coefficient vector register; receiving, by the processor, at least M+N−1 data values; processing consecutive sets of N consecutive data values of the received data values by: loading, by the processor, each data value of the set of N consecutive data values into all PEs of a data vector register of the set of data vector registers;accumulating, by the processor, in a result vector register of the set of result vector registers, a sum of products of consecutive coefficient vector registers with data vector registers containing the consecutive data values; andoutputting, by the processor, the contents of each result vector register containing a sum of the products of all the coefficient vector registers with a data vector register, wherein each result vector register containing a sum of all the products of all the coefficient vectors register with a corresponding data vector register is then reused for accumulating.
地址 Armonk NY US