发明名称 Accelerated interlane vector reduction instructions
摘要 A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.
申请公布号 US9588766(B2) 申请公布日期 2017.03.07
申请号 US201213630154 申请日期 2012.09.28
申请人 Intel Corporation 发明人 Caprioli Paul;Kanhere Abhay S.;Cook Jeffrey J.;Al-Otoom Muawya M.
分类号 G06F12/00;G06F7/38;G06F9/00;G06F9/44;G06F9/30 主分类号 G06F12/00
代理机构 Nicholson De Vos Webster & Elliott LLP 代理人 Nicholson De Vos Webster & Elliott LLP
主权项 1. An apparatus comprising: a plurality of vector registers, wherein each vector register is divided into a plurality of lanes, and each lane stores a same number of data elements; and execution circuitry coupled to the plurality of vector registers, the execution circuitry to: receive a vector reduction instruction to reduce an array of the data elements stored in a source operand into a result in a destination operand using a reduction operator, wherein each of the source operand and the destination operand is one of the vector registers,responsive to the vector reduction instruction, apply the reduction operator to two of the data elements in each lane, reduce the two data elements into one data element, and shift one or more remaining data elements when there is at least one of the data elements remaining in each lane;wherein the execution circuitry is to convert reduction code without the vector reduction instruction into translated reduction code with the vector reduction instruction, wherein the reduction code and the translated reduction code specify a same sequence of reduction operations applied to the array of data elements across the plurality of lanes and generate a same result.
地址 Santa Clara CA US