Novelty at the fair: New tool for simultaneous face and speaker recognition enables fast people search in large media archives

September 11, 2023

The new combined face and speaker analysis offers program planners a comprehensive view of individual presences in TV broadcasts. For this purpose, the Audiovisual Identity Suite analyzes a large volume of data, i.e., any programs over many weeks, within a very short time. The results of the audiovisual recognition of specific persons are pre-sented in an easy-to-understand and intuitive user interface and can be used for in-depth insights, trend analyses, and statistics.

If you want to detect the media presence of a specific individual within a program dur-ing a certain time, the tool shows in a so-called heatmap when and how often they were visible or audible on different TV channels. An important feature of the tool is that it also works reliably when the relevant person is speaking but is not shown in the picture. This is especially of interest in situations such as talk shows where reactions from the audience are captured, or other panelists are faded in while the person on the podium continues to speak.

This is possible by combining audio and video analysis methods. The institute has long-standing expertise in both research disciplines. Both analysis methods have already been successfully applied to various products and solutions.

Cross-modal combination of audio and video analysis methods
For the first time, the Audiovisual Identity Suite combines both methods into a cross-modal analysis tool. "This increases the validity and quality of the results significantly", explains Dr. Uwe Kühhirt, expert for video analysis at Fraunhofer IDMT and co-devel-oper of the Audiovisual Identity Suite. To identify people acoustically in programs, the institute relies on AI-based algorithms for recognizing speakers and classifying perceived gender. In addition, speech quality analysis enables the evaluation of entire programs or individual parts of programs re-garding their acoustic intelligibility.

Intelligent face recognition is used for the visual recognition of people in videos. In this process, facial features such as the visually perceived gender are extracted from the video data. In combination with the previously mentioned acoustic classification of per-ceived gender, very reliable statements can be made about how often men and women are seen or heard in the program. These findings can help, for example, in planning more gender-appropriate programming and for reporting.

Identity Suite
Analyses and studies with the Au-diovisual Identity Suite are initially carried out by Fraunhofer IDMT on behalf of the customer. The results of the analyses are then made available to the client in a custom-ized user interface, prepared for his specific purposes.
In the future, the analysis tool should also be licensable for use at the customer's site.

Upcoming enhancements
The Audiovisual Identity Suite is set for further expansion. Upcoming features include age estimation based on visual analysis and audio advancements such as language recognition, speech-to-text conversion, and keyword analytics.
"Our planned enhancements will provide deeper opportunities for analysis. With the addition of text transcription, we can not only determine how often certain people appear but also which topics they are talking about," explains Christian Rollwage, expert for speaker recognition at the Fraunhofer IDMT Ol-denburg Branch for Hearing, Speech and Audio Technology HSA. Discover how the Audiovisual Identity Suite can simplify your daily work. Visit us from September 15 to 18, 2023 at the IBC trade show in Hall 8 at the Fraunhofer-Gesellschaft booth B.80 and let our experts show you the advantages of the new cross-modal analysis tool Audio-visual Identity Suite.