发明名称 Source code search engine
摘要 A source code search comprises a two-pass search. The first pass comprises a topological measure of similarity. The second pass comprises a semantic measure of similarity. The query source code is a user-selected portion of source code. The results may be ranked and output to an I/O device.
申请公布号 US9378242(B1) 申请公布日期 2016.06.28
申请号 US201514974342 申请日期 2015.12.18
申请人 International Business Machines Corporation 发明人 Fontenot Nathan;Gunter Fionnuala G.;Strosaker Michael T.;Wilson George C.
分类号 G06F17/30 主分类号 G06F17/30
代理机构 代理人 Sabo Stosch;Bennett Steven L.
主权项 1. A computer-implemented method of identifying similar source code components comprising: creating a respective abstract syntax tree (AST) of each of a user-defined query source code data set and at least one target source code data set, wherein the user-defined query source code data set comprises a selected portion of source code in a given programming language comprising a complete function, wherein the target source code data set comprises at least one file within at least one repository containing source code in the given programming language; calculating a respective first similarity value for each of one or more portions of each of the at least one target source code data sets, wherein each respective first similarity value comprises a topological measure of similarity between the user-defined query source code data set and each respective portion of the at least one target source code data set, wherein calculating the respective first similarity value further comprises: calculating, for the query source code abstract syntax tree, a first number of vertices and edges; calculating, for each respective target source code abstract syntax subtree, a respective second number of vertices and edges; calculating, for each respective target source code abstract syntax subtree a respective absolute value of a difference between the first number and the respective second number; and comparing, for each respective target source code abstract syntax subtree, the respective absolute value to a first threshold; identifying portions of each of the at least one target source code data sets having the respective first similarity value less than or equal to the first threshold, wherein the first threshold comprises a permissible difference in the number of vertices, edges, or vertices and edges between the user-defined query source code abstract syntax tree and the respective target source code abstract syntax subtree; calculating a respective second similarity value for each portion of the target source code data set having the respective first similarity value less than or equal to the first threshold, the respective second similarity value comprising a semantic measure of similarity between the user-defined query source code data set and each respective portion of the target source code data set having the respective first similarity value less than or equal to the first threshold, wherein calculating the respective second similarity value further comprises: identifying one or more series of operations to transform the target source code abstract syntax subtree to the query source code abstract syntax tree, wherein said series of operations comprises one or more of insert, delete, and rename operations; calculating, for each identified series of operations, a cost of the identified series of operations, wherein the cost of the identified series of operations is associated with one or more of insert, delete, and rename operations; wherein the cost of the identified series of operations is the respective second similarity value; and selecting the series of operations having a lowest cost; outputting, to a user interface, each portion of each target source code data set having the second similarity value less than or equal to a second threshold, wherein each portion is ranked according to the second similarity value.
地址 Armonk NY US