| Abstract: |
In recent years, technologies for transmitting remote video have been widely adopted as part of social infrastructure, including remote operation using drones and unmanned exploration vehicles, as well as remote meetings. In transmission systems that require real-time performance, not only video quality but also latency is a critical issue. For example, in remote vehicle operation, delays in video transmission can lead to slower situational awareness, potentially resulting in serious accidents. To address these issues, technologies that compensate for video latency, namely, techniques for predicting future frames, are required. This study investigates a future video generation method for vehicle-mounted camera video.
Considering the recent improvements in image resolution, generating high-resolution images in real time for such image prediction tasks is impractical from a computational standpoint. Therefore, this study employs a Variational Auto-Encoder (VAE) to encode images into a low-dimensional latent space and estimates features representing future frames within this space. The latent features are then decoded back into image space to generate future images. In addition, we employ semantic segmentation, depth estimation, and optical flow estimation on the input images to provide auxiliary information for future image generation. This auxiliary information is integrated with RGB images using a Vision Transformer (ViT). To achieve this, the proposed method utilizes an Auxiliary Information Extraction ViT and a Future Video Generation ViT. In the Auxiliary Information Extraction ViT, depth, segmentation, and optical flow data are tokenized, and the concatenated tokens are analyzed using a self-attention structure. The resulting tokens are then split according to their respective modalities and processed by separate MLPs. The outputs are concatenated again, and the same operation is repeated. This approach enables independent analysis of each modality while leveraging inter-modal relationships to obtain auxiliary tokens. Next, the Video Generation ViT takes RGB tokens and auxiliary tokens as input and applies a cross-attention structure, where RGB tokens serve as keys and values, while auxiliary tokens act as queries, to obtain latent features representing future frames. Finally, the decoder reconstructs the latent features into images to generate future frames.
To evaluate its effectiveness, we applied the proposed method to videos recorded by a vehicle-mounted camera for future video generation. Experiments were conducted on dynamic image sets extracted at three-frame intervals from the public datasets Cityscapes (16.6 fps) and WayveScenes (10 fps), resulting in frame rates of 5.6 fps and 3.3 fps, respectively. Among these, 1,400 scenes were used for training and 250 scenes for evaluation. Four consecutive frames were used as input to predict the subsequent frame. The mean absolute error between predicted and ground-truth images was 9.3 for the proposed method, compared to 11.3 for the baseline method without auxiliary information. These results confirm that the proposed method, leveraging VAE and multi-modal information, achieves higher prediction accuracy. However, the generated images exhibited blurred edges, suggesting difficulty in restoring high-frequency components. Therefore, future work will focus on developing methods for predicting higher-resolution images. |