发明名称 Grouping failure events with adaptive polling and sliding window buffering
摘要 Embodiments detect and group multiple failure events to enable batch processing of those failure events, such as in a virtual datacenter executing a plurality of virtual machines (VMs). A long timer, adaptive short timer, and adaptive polling frequency enable a computing device to efficiently detect and group the failure events that may be related (e.g., resulting from one failure). The grouped failure events are processed in parallel thereby reducing the time for recovery from the failure events.
申请公布号 US9507685(B2) 申请公布日期 2016.11.29
申请号 US201313856167 申请日期 2013.04.03
申请人 VMware, Inc. 发明人 Gondi Anjaneya Prasad;Kalluri Hemanth;Kalaskar Naveen Kumar
分类号 G06F11/00;G06F11/30;G06F11/07 主分类号 G06F11/00
代理机构 代理人
主权项 1. A system for failure event detection and grouping using adaptive polling intervals and sliding window buffering, said system comprising: one or more memory areas associated with or accessible by a computing device-storing a plurality of virtual machines (VMs) and in communication with one or more associated datastores, the memory areas including a value for a short timer, and a value for a long timer; and a processor programmed to: upon detection of a failure event affecting at least one of the plurality of VMs or associated datastores, initiate the short timer and the long timer and poll for additional failure events during each of a series of polling intervals, wherein the series of polling intervals continue until either the short timer or the long timer expires, wherein a duration of each subsequent polling interval of the series depends on whether an additional failure was detected during a respective preceding polling interval of the series, the polling during each polling interval of the series of polling intervals comprising: upon detection of at least one of the additional failure events during a particular polling interval, collecting data relating to the detected at least one additional failure event, resetting the short timer, and reducing a duration of a next polling interval relative to the particular polling interval; andupon no detection of at least one of the additional failure events during a particular polling interval, increasing a duration of a next polling interval, relative to the particular polling interval;group the detected failure event with the detected at least one additional failure event into a group of failure events; andperform recovery operations in parallel for each failure event in the group of failure events.
地址 Palo Alto CA US