发明名称 Using speaker clustering to switch between different camera views in a video conference system
摘要 A video conference endpoint includes one or more cameras to capture video of different views and a microphone array to sense audio. One or more closeup views are defined. The endpoint detects faces in the captured video and active audio sources from the sensed audio. The endpoint detects any active talker having detected face positions that coincide with detected active audio sources, and also uses speaker clustering to detect whether any active talker is associated with a previously stored closeup views. Based on whether an active talker is detected in any of the stored closeup views, the endpoint switches between capturing video of one of the closeup views and a best overview of the participants in the conference room.
申请公布号 US9633270(B1) 申请公布日期 2017.04.25
申请号 US201615091056 申请日期 2016.04.05
申请人 Cisco Technology, Inc. 发明人 Tangeland Kristian;Hellerud Erik;Birkenes Oystein
分类号 H04N7/15;G06K9/00;H04N5/232;G10L17/00;G10L17/10 主分类号 H04N7/15
代理机构 Edell, Shapiro & Finnan, LLC 代理人 Edell, Shapiro & Finnan, LLC
主权项 1. A method comprising: at a video conference endpoint including a microphone and a camera: detecting a talker position of a talker based on audio detected by the microphone; detecting one or more faces and face positions in video captured by the camera; determining whether a detected talker position matches any detected face position; if there is no match, performing speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio; if the speaker clustering indicates the talker is known, determining whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and based on results of the determining whether the detected talker position matches the previous closeup position of the talker, framing either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position.
地址 San Jose CA US