International Journal of Applied Science - Research and Review Open Access

  • ISSN: 2394-9988
  • Journal h-index: 10
  • Journal CiteScore: 2.27
  • Journal Impact Factor: 1.33
  • Average acceptance to publication time (5-7 days)
  • Average article processing time (30-45 days) Less than 5 volumes 30 days
    8 - 9 volumes 40 days
    10 and more volumes 45 days

Short Communication - (2023) Volume 10, Issue 1

Multi-Modal Fusion Structure for Dense Video Captioning
Chuan Zhang*
1Department of Information Technology, University of Sydney, Australia
*Correspondence: Chuan Zhang, Department of Information Technology, University of Sydney, Australia, Email:

Received: 02-Jan-2023, Manuscript No. ipias-23-15756; Editor assigned: 04-Jan-2023, Pre QC No. ipias-23-15756 (PQ); Reviewed: 18-Jan-2023, QC No. ipias-23-15756; Revised: 23-Jan-2023, Manuscript No. ipias-23-15756 (R); Published: 30-Jan-2023, DOI: 10.36648/2394-9988-10.1.05


Dense video subtitling is a kind of machine interpretation that attempts to isolate occasions from the whole video and limit them. Outlines with a great deal of data and copies dazzle human watchers who parse the video, yet video much of the time presents irrelevant edges in the subject. Nonetheless, these specifics have been to a great extent overlooked in past works. We propose an upgraded visual multi-modular combination structure that utilizes the subtitles of video key frames to further develop dense video subtitling execution to integrate human visual insight into the most common way of understanding video completely. We first concentrate video key frames through time stamps and subsequently apply the actually proposed picture captioning procedure DLCT to get a fleetingly changed engraving of the keyframe. Utilizing transformer engineering, Evmff joins discourse text, subtitles from keyframe pictures, video elements, and sound highlights to create text depictions. Using the ActivityNet Subtitles dataset and four particular pointers is utilized to approve our model’s presentation. Removal tests show that involving the video keyframe depiction as the contribution to the multi-modular model compensates for an absence of visual data understanding. Our code will be made accessible.


Regular language handling and PC vision are joined in video subtitling. It regularly comprises of two stages: The initial step is to fathom the data in the video by distinguishing the visual elements and their activities and looking at how they connect with each other; in the subsequent stage, a linguistically right normal language depiction and a semantic match are made to make sense of the video’s substance. The perception of the whole video is summarized in a solitary sentence with customary video subtitling. In any case, this is essentially relevant to accounts with a single scene. Utilizing a solitary sentence to depict a video all in all might miss significant subtleties because of the way that genuine recordings commonly contain various occasions and scenes with unessential foundation data. In view of this defect, it is difficult to write in a way that is both familiar and relevant to sum up the principal thought of the video completely. Dense video subtitling is utilized to address the previously mentioned issues by partitioning the scene and occasions into unmistakable sections and making graphic sentences for each fragment. Unimodal models have been the focal point of a ton of past examination on dense video subtitling. Many signs are missed on the grounds that video contains a great deal of acoustic and language data. The model has been attempted to consolidate extra signals, including discourse and sound information, and their importance has been underlined in ensuing examination. Video, as opposed to in any case pictures, has more visual substance. The cut video will in any case contain dissimilar substance, occasions, and activities even after division. The cut video has such a large number of scenes that aren’t pertinent to the principal content due to this peculiarity, which makes it difficult for the model to grasp the video. Thus, the test of multimodal dense video subtitling is to create a video portrayal that sums up the real satisfied of the video considering changing and dynamic visual data. Extricating semantic subtleties from video content is troublesome in light of the fact that it needs reliable and definite semantic subtleties [1-4].


Thus, to make it more straightforward for PCs to understand the visual substance, it is crucial for track down ways of researching the video’s applicable key semantic data. As advantageous semantic data, they endeavored to find video-like picture information base titles. Notwithstanding creating positive outcomes, this technique depended too vigorously on the picture data set, bringing about an absence of variety in the information. Multimodal input is investigated in Tvr, VX2TEXT, and Legend. VX2TEXT explicitly examines the utilization of sack of-words as video portrayals by using language inserting models and prepared influence methodology explicit classifiers to foresee classification marks from predefined language vocabularies.



Conflict of Interest

The author declares there is no conflict of interest in publishing this article.


Citation: Zhang C (2023) Multi-Modal Fusion Structure for Dense Video Captioning. Int J Appl Sci Res Rev. 10:05.

Copyright: © 2023 Zhang C. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.