发明名称 Storage configuration in data warehouses
摘要 Techniques are described for employing a graph-based analysis to determine a configuration of datasets to be stored on data storage systems in a data warehouse environment. Associations between datasets may be determined based on the parsing of join statements or other types of statements in jobs that are executed on the data storage systems. A graph may be generated that describes the associations among datasets. A greedy breadth-first traversal of the graph may be performed to determine sets of associated datasets. A utilization metric describing a weight of storing the datasets may be determined and employed to identify a data storage system on which to store a set of associated datasets, given the storage and processing capacity of the data storage system.
申请公布号 US9563687(B1) 申请公布日期 2017.02.07
申请号 US201414540648 申请日期 2014.11.13
申请人 Amazon Technologies, Inc. 发明人 Dutta Arnab;Muthiah Ramanathan;Rajagopalan Srinivasan V.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 Lindauer Law, PLLC 代理人 Lindauer Law, PLLC
主权项 1. A computer-implemented method, comprising: accessing dataset association metadata describing associations among tables to be stored on at least one of a plurality of data storage systems, wherein an association between two tables corresponds to a join statement between the two tables, the join statement included in a job to be executed on the at least one of the plurality of data storage systems; determining a graph that describes the associations among the tables, the graph comprising: vertices corresponding to individual ones of the tables; andedges connecting pairs of the vertices, wherein an edge corresponds to the association between the two tables; traversing the graph to determine a set of vertices that are in at least one associative tree; determining an amount of storage space to be used by a set of tables corresponding to the set of vertices; determining a data storage system characterized by an available storage capacity that is at least the amount of storage space to be used by the set of tables, the data storage system included in the plurality of data storage systems; and storing the set of tables on the data storage system.
地址 Seattle WA US