发明名称 Efficient partitioning techniques for massively distributed computation
摘要 A repartitioning optimizer identifies alternative repartitioning strategies and selects optimal ones, accounting for network transfer utilization and partition sizes in addition to traditional metrics. If prior partitioning was hash-based, the repartitioning optimizer can determine whether a hash-based repartitioning can result in not every computing device providing data to every other computing device. If prior partitioning was range-based, the repartitioning optimizer can determine whether a range-based repartitioning can generate similarly sized output partitions while aligning input and output partition boundaries, increasing the number of computing devices that do not provide data to every other computing device. Individual computing devices, as they are performing a repartitioning, assign a repartitioning index to each individual data element, which represents the computing device to which such a data element is destined. The indexed data is sorted by such repartitioning indices, thereby grouping together all like data, and then stored in a sequential manner.
申请公布号 US8996464(B2) 申请公布日期 2015.03.31
申请号 US201213494006 申请日期 2012.06.11
申请人 Microsoft Technology Licensing, LLC 发明人 Zhou Jingren;Bruno Nicolas;Lin Wei
分类号 G06F17/30;G06F9/50 主分类号 G06F17/30
代理机构 代理人 Tabor Ben;Drakos Kate;Minhas Micky
主权项 1. A method for repartitioning a data set among multiple computing devices, the method comprising the steps of: receiving, at a computing device, a query directed to the data set, which is currently partitioned into a first quantity of partitions distributed among a first set of computing devices; selecting, if the current partitioning was based upon hash values of individual data entries of the data set, a repartitioning strategy at the computing device to repartition the data set among a second set of computing devices, the selected repartitioning strategy comprising: hashing values of individual data entries of the data set with a same hash function as was utilized to effectuate the current partitioning; and selecting a second quantity of partitions into which the data set is to be repartitioned to have a common positive factor greater than one with the first quantity of partitions, thereby expressly providing that at least some computing devices of the first set of computing devices do not provide data to at least some computing devices of the second set of computing devices; selecting, if the current partitioning was based upon ranges of values of a first collection of one or more data elements of the data entries of the data set, the repartitioning strategy at the computing device to repartition the data set among the second set of computing devices, the selected repartitioning strategy comprising: selecting a second collection of one or more data elements of the data entries of the data set to include at least one data element from the first collection; and selecting ranges of values of the second collection of one or more data elements of the data entries of the data set such that repartitioning the data set in accordance with the selected ranges of values results in at least some computing devices of the first set of computing devices not providing data to at least some computing devices of the second set of computing devices; and instructing, by the computing device, the computing devices of the first set of computing devices to repartition the data set in accordance with the selected repartitioning strategy.
地址 Redmond WA US