摘要 |
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for flow analysis. In one aspect, a method includes modifying a dataflow graph, the dataflow graph including a plurality of paths connecting at least one entry point and at least one exit point, including adding components to the dataflow graph that add flow units to data records and remove flow units from data records, each flow unit identifying a segment of a path traversed by the data record. The method also includes identifying execution paths based on flow units obtained by processing a plurality of data records using the modified dataflow graph. The method also includes determining a subset of the plurality of data records, wherein a selected set of execution paths are represented by the subset. |
主权项 |
1. A computer-implemented method including:
modifying a dataflow graph, the dataflow graph including a plurality of paths connecting at least one entry point and at least one exit point, including:
adding components to the dataflow graph that add flow units to data records and remove flow units from data records, each flow unit tagging a specified data record with information identifying (i) a segment of a path through the dataflow graph traversed by the specified data record, and (ii) one or more other data records upon which the specified data record depends, when the specified data record is dependent on one or more other data records; for a data record processed using the modified dataflow graph,
generating, based on one or more flow units tagging the data record, a record lineage that specifies (i) which one of the plurality of paths of the dataflow graph is traversed by the data record, and (ii) one or more other data records upon which the processed data record depends, when the data record is dependent on one or more other data records; based on record lineages generated, identifying execution paths of the data records through the modified dataflow graph including the plurality of paths connecting the at least one entry point and the at least one exit point, wherein a first one of the execution paths through the modified dataflow graph traversed by a first one of the data records is distinct from a second one of the execution paths through the modified dataflow graph traversed by a second one of the data records; and based on a selected set of the execution paths through the modified dataflow graph including the plurality of paths connecting the at least one entry point and the at least one exit point, determining a subset of the plurality of data records having traversed that selected set of the execution paths. |