发明名称 Use of incremental checkpoints to restore user data stream processes
摘要 A method and system on failure recovery in a storage system are disclosed. In the storage system, user data streams (e.g., log data) are collected by a scribeh system. The scribeh system may include a plurality of Calligraphus servers, HDFS and Zookeeper. The Calligraphus servers may shard the user data streams based on keys (e.g., category and bucket pairs) and stream the user data streams to Puma nodes. Sharded user data streams may be aggregated according to the keys in memory of a specific Puma node. Periodically, aggregated user data streams cached in memory of the specific Puma node, together with a Incremental checkpoint, are persisted to HBase. When a specific process on the specific Puma node fails, Ptail retrieves the Incremental checkpoint from HBase and then restores the specific process by requesting user data streams processed by the specific process from the scribeh system according to the Incremental checkpoint.
申请公布号 US9471436(B2) 申请公布日期 2016.10.18
申请号 US201313868873 申请日期 2013.04.23
申请人 Facebook, Inc. 发明人 Rash Samuel;Borthakur Dhrubajyoti;Khemani Prakash;Shao Zheng
分类号 G06F11/14;G06Q50/00 主分类号 G06F11/14
代理机构 Perkins Coie LLP 代理人 Perkins Coie LLP
主权项 1. A method, comprising: collecting user data streams from a plurality of different sources; wherein the user data streams are collected by a data stream processing system; sharding the user data streams based on keys, the keys including a plurality of categories; wherein each category is subdivided into one or more buckets;streaming the sharded user data streams to an application node; aggregating the sharded user data streams in memory of the application node according to the keys; periodically persisting memory content and a most current incremental checkpoint on the application node to a storage; and in an event a specific process of the application node fails, generating a first process for restoring the specific process, wherein the first process is configured to: retrieve the most current incremental checkpoint from the storage,request user data streams handled by the failed specific process from the data stream processing system according to the most current incremental checkpoint, wherein the requested user data streams are obtained from any of (a) a file storage system based on directory information in the most current incremental checkpoint, (b) one or more of multiple servers based on server identification (ID) in the most current incremental checkpoint or (c) the storage to which in-memory data of the application node is persisted, the in-memory data including user data streams handled by the failed specific process, andrestore the specific process on the application node in real time based on the requested user data streams and the most current incremental checkpoint and the aggregated sharded user data streams.
地址 Menlo Park CA US