论文简介

原论文：Multimodal Transformer for Unaligned Multimodal Language Sequences¹
针对非对齐多模态语言序列的多模态Transformer

论文地址：https://arxiv.org/abs/1906.00295

源码地址：https://github.com/yaohungt/Multimodal-Transformer

以下仅为作者阅读论文时的记录，学识浅薄，如有错误，欢迎指正。

论文内容

摘要

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors.
人类的语言通常是多模态的，包含了自然语言、面部姿态以及声学行为。
However, two major challenges in modeling such multimodal human language time-series data exist:
无论如何，建模这种多模态人类语言的时间序列数据存在两个主要的挑战：
1）inherent data non-alignment due to variable sampling rates for the sequences from each modality;
1）每个模态序列的采样率差异导致固有数据没有对齐；
and 2）long-range dependencies between elements across modalities.
2）跨模态元素存在长期的依赖。
In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data.
本文介绍了一种多模态Transformer（MulT） 以一种不用显示数据对齐的端到端方式来解决上述问题。
At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.
本文模型的核心是定向跨模态注意力，它关注的是时间步长下多模态序列间的交互，以及潜在适应从某个模态到另一模态的流。
Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin.
在对齐和非对齐的多模态时间序列上的实验都表明本文的模型超越了SOTA。
In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.
此外，实证分析表明，通过MulT中跨模态的注意力机制可以捕获到相关的跨模态信号。

1 介绍

Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, 2019: 6558. ??