发明名称 SOURCE CODE SEARCH ENGINE
摘要 A source code search comprises a two-pass search. The first pass comprises a topological measure of similarity. The second pass comprises a semantic measure of similarity. The query source code is a user-selected portion of source code. The results may be ranked and output to an I/O device.
申请公布号 US2017046250(A1) 申请公布日期 2017.02.16
申请号 US201615342183 申请日期 2016.11.03
申请人 International Business Machines Corporation 发明人 Fontenot Nathan;Gunter Fionnuala G.;Strosaker Michael T.;Wilson George C.
分类号 G06F11/36;G06F9/44 主分类号 G06F11/36
代理机构 代理人
主权项 1. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising: creating a respective abstract syntax tree (AST) of each of a user-defined query source code data set and at least one target source code data set, wherein the user-defined query source code data set comprises a selected portion of source code in a given programming language comprising a complete function, wherein the target source code data set comprises at least one file within at least one repository containing source code in the given programming language; calculating a respective first similarity value for each of one or more portions of each of the at least one target source code data sets, wherein each respective first similarity value comprises a topological measure of similarity between the user-defined query source code data set and each respective portion of the at least one target source code data set, wherein calculating a respective first similarity value further comprises: calculating, for the query source code abstract syntax tree, a first number of vertices and edges;calculating, for each respective target source code abstract syntax subtree, a respective second number of vertices and edges;calculating, for each respective target source code abstract syntax subtree a respective absolute value of a difference between the first number and the respective second number; andcomparing, for each respective target source code abstract syntax subtree, the respective absolute value to a first threshold; identifying portions of each of the at least one target source code data sets having a respective first similarity value less than or equal to the first threshold, wherein the first threshold comprises a permissible difference in the number of vertices, edges, or vertices and edges between the user-defined query source code abstract syntax tree and a respective target source code abstract syntax subtree; calculating a respective second similarity value for each portion of a target source code data set having a respective first similarity value less than or equal to the first threshold, the respective second similarity value comprising a semantic measure of similarity between the user-defined query source code data set and each respective portion of the target source code data set having a respective first similarity value less than or equal to the first threshold, wherein calculating a respective second similarity value further comprises: identifying one or more series of operations to transform a target source code abstract syntax subtree to the query source code abstract syntax tree, wherein said series of operations comprises one or more of insert, delete, and rename operations;calculating, for each identified series of operations, a cost of the identified series of operations, wherein the cost of the identified series of operations is associated with one or more of insert, delete, and rename operations; andselecting the series of operations having a lowest cost; outputting, to a user interface, each portion of each target source code data set having a second similarity value less than or equal to a second threshold, wherein each portion is ranked according to the second similarity value.
地址 Armonk NY US