发明名称 Systems and methods for labeling source data using confidence labels
摘要 Systems and methods for the annotation of source data using confidence labels in accordance embodiments of the invention are disclosed. In one embodiment of the invention, a method for determining confidence labels for crowdsourced annotations includes obtaining a set of source data, obtaining a set of training data representative of the set of source data, determining the ground truth for each piece of training data, obtaining a set of training data annotations including a confidence label, measuring annotator accuracy data for at least one piece of training data, and automatically generating a set of confidence labels for the set of unlabeled data based on the measured annotator accuracy data and the set of annotator labels used.
申请公布号 US9355359(B2) 申请公布日期 2016.05.31
申请号 US201313915962 申请日期 2013.06.12
申请人 California Institute of Technology 发明人 Welinder Peter;Perona Pietro
分类号 G06N5/04;G06F17/30 主分类号 G06N5/04
代理机构 KPPB LLP 代理人 KPPB LLP
主权项 1. A method for determining labels for crowdsourced annotations, comprising: obtaining a set of source data using a distributed data annotation server system comprising a processor and a memory readable by the processor, where the source data comprises a set of unlabeled data; obtaining a set of training data using the distributed data annotation server system, where the set of training data comprises a subset of the source data representative of the set of source data; determining ground truth data describing the ground truth for each piece of training data in the set of training data using the distributed data annotation server system, where the ground truth data for a piece of training data describes the content of the piece of data; generating sets of annotator data based on the set of source data and the set of training data, where a set of annotator data comprises at least one piece of source data selected from the set of source data and at least one piece of training data selected from the set of training data; providing the sets of annotator data to a plurality of data annotation devices, wherein a data annotation device: obtains the set of annotator data;generates a set of annotation data based on the obtained set of annotator data and a set of annotator characteristics describing the data annotation device, wherein the set of annotation data comprises at least one source data annotation applied to a piece of source data and one training data annotation applied to a piece of training data, where a training data annotation comprises data describing the piece of training data and a confidence label selected from a set of confidence labels describing a measure of confidence in the accuracy of the data describing the piece of training data; andtransmits the set of annotation data; obtaining the sets of annotation data from the plurality of data annotation devices using the distributed data annotation server system; calculating annotator accuracy data for each data annotation device for at least one piece of training data in the set of training data based on the ground truth for each piece of training source data and the set of training data annotations using the distributed data annotation server system, wherein the annotator accuracy data describes the accuracy of annotation data provided by a particular data annotation device based on the accuracy and confidence indicated for one or more pieces of training data provided to the particular data annotation device; and automatically generating a set of labels for each piece of unlabeled data in the set of source data based on the calculated annotator accuracy data and the set of annotator labels received from the plurality of data annotation devices using the distributed data annotation server system.
地址 Pasadena CA US