发明名称 Trend data clustering
摘要 In various embodiments, systems, methods, and techniques are disclosed for generating a collection of clusters of related data from a seed. Seeds may be generated based on seed generation strategies or rules. Clusters may be generated by, for example, retrieving a seed, adding the seed to a first cluster, retrieving a clustering strategy or rules, and adding related data and/or data entities to the cluster based on the clustering strategy. Various cluster scores may be generated based on attributes of data in a given cluster. Further, cluster metascores may be generated based on various cluster scores associated with a cluster. Clusters may be ranked based on cluster metascores. Various embodiments may enable an analyst to discover various insights related to data clusters, and may be applicable to various tasks including, for example, tax fraud detection, beaconing malware detection, malware user-agent detection, and/or activity trend detection, among various others.
申请公布号 US9177344(B1) 申请公布日期 2015.11.03
申请号 US201314139640 申请日期 2013.12.23
申请人 Palantir Technologies Inc. 发明人 Singh Harkirat;Weickert Brendan;Sprague Matthew;Kross Michael;Borochoff Adam;Menon Parvathy;Harris Michael
分类号 G06Q40/00;G06Q40/02 主分类号 G06Q40/00
代理机构 Knobbe, Martens, Olson & Bear, LLP 代理人 Knobbe, Martens, Olson & Bear, LLP
主权项 1. A computer system to assist a human analyst in analyzing large amounts of trend data of computing devices, the computer system comprising: one or more computer readable storage devices configured to store: one or more software modules including computer executable instructions, the one or more software modules including a cluster engine module and a workflow engine module; anda clustering strategy; one or more cluster data sources configure to store: a plurality of host-based events associated with one or more computing devices;a plurality of activity trend-related data items and properties associated with respective activity trend-related data items, each of the properties including associated property values, the activity trend-related data items including at least one of: data items associated with captured host-based events, Internet Protocol addresses, external domains, users, or computerizing devices, wherein hosts comprise computerizing devices in a network; and one or more hardware computer processors in communication with the one or more computer readable storage devices and the one or more cluster data sources, and configured to execute the one or more software modules in order to cause the one or more hardware computer processors to: designate, by the cluster engine module, one or more seeds by: accessing, from the one or more cluster data sources, the plurality of host-based events;determining a first group of the plurality of host-based events each indicating a same particular activity type and associated with a particular host and a reference time period;determining, based at least on the first group of host-based events, a first statistical deviation in the same particular activity type of host-based events on the particular host for the reference time period;determining a second group of the plurality of host-based events each indicating the same particular activity type and associated with the particular host and a test time period;determining, based at least on the second group of host-based events, a second statistical deviation in the same particular activity type of host-based events on the particular host for the test time period; andin response to determining that the first statistical deviation compared to the second statistical deviation satisfies a particular threshold, designating a host-based event from the second group as a seed;for each designated host-based event seed: identify, by the cluster engine module, one or more activity trend-related data items determined to be associated with the designated host-based event seed based at least on the clustering strategy, wherein the clustering strategy queries the one or more cluster data sources to determine at least one of: the particular host associated with the designated host-based event seed, one or more host-based events associated with the particular host, one or more host-based events associated with the designated host-based event seed, users of the particular host, data items associated with the particular host, other hosts associated with the same particular activity type of host-based events, Internet Protocol addresses associated with the particular host, external domains associated with the designated host-based event seed, computing devices associated with the particular host;generate, by the cluster engine module, a data item cluster based at least on the designated host-based event seed, wherein generating the data item cluster comprises: adding the designated host-based event seed to the data item cluster;adding the identified one or more activity trend-related data items to the data item cluster;identifying an additional one or more activity trend-related data items associated with any data item of the data item cluster;adding the additional one or more activity trend-related data items to the item data cluster; andstoring the generated data item cluster in the one or more computer readable storage devices; anddetermine, by the cluster engine module, a score for the generated data item cluster; andcause presentation, by the workflow engine module, of at least one generated data item cluster and the determined score for the at least one generated data item cluster in a user interface of a client computing device.
地址 Palo Alto CA US